Try smallcloudai/Refact-1_6B-fim

simonw commented 10 months ago

https://huggingface.co/smallcloudai/Refact-1_6B-fim - via https://news.ycombinator.com/item?id=37381862

simonw commented 10 months ago

>>> checkpoint = "smallcloudai/Refact-1_6B-fim"
>>> device = "cpu"
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 717/717 [00:00<00:00, 1.27MB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████| 777k/777k [00:00<00:00, 5.37MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████| 442k/442k [00:00<00:00, 3.40MB/s]
Downloading (…)/main/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████| 2.06M/2.06M [00:00<00:00, 8.14MB/s]
Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 532/532 [00:00<00:00, 1.54MB/s]
>>> 
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)
Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 764/764 [00:00<00:00, 2.29MB/s]
Downloading (…)ration_gpt_refact.py: 100%|████████████████████████████████████████████████████████████████████████████████████| 2.00k/2.00k [00:00<00:00, 4.00MB/s]
A new version of the following files was downloaded from https://huggingface.co/smallcloudai/Refact-1_6B-fim:
- configuration_gpt_refact.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading (…)deling_gpt_refact.py: 100%|████████████████████████████████████████████████████████████████████████████████████| 23.8k/23.8k [00:00<00:00, 58.1MB/s]
A new version of the following files was downloaded from https://huggingface.co/smallcloudai/Refact-1_6B-fim:
- modeling_gpt_refact.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading pytorch_model.bin:   9%|███████▊                                                                                   | 545M/6.34G [00:13<02:28, 39.0MB/s]

simonw commented 10 months ago

% find .cache | grep deling
.cache/huggingface/modules/transformers_modules/smallcloudai/Refact-1_6B-fim/bdd25a3e0a3440b12964be202e27e9ee2cfce835/modeling_gpt_refact.py
.cache/huggingface/hub/models--smallcloudai--Refact-1_6B-fim/snapshots/bdd25a3e0a3440b12964be202e27e9ee2cfce835/modeling_gpt_refact.py

simonw commented 10 months ago

import time

start = time.time()

prompt = '<fim_prefix>def print_hello_world():\n    """<fim_suffix>\n    print("Hello world!")<fim_middle>'

inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=100, temperature=0.2)
print("-"*80)
print(tokenizer.decode(outputs[0]))

end = time.time()

print(f"Time elapsed: {end-start} seconds")

simonw commented 10 months ago

>>> outputs = model.generate(inputs, max_length=100, temperature=0.2)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
print("-"*80)
print(tokenizer.decode(outputs[0]))

end = time.time()

print(f"Time elapsed: {end-start} seconds")
>>> print("-"*80)
--------------------------------------------------------------------------------
>>> print(tokenizer.decode(outputs[0]))
<fim_prefix>def print_hello_world():
    """<fim_suffix>
    print("Hello world!")<fim_middle>Prints 'Hello world!'"""<|endoftext|>
>>> 
>>> end = time.time()
>>> 
>>> print(f"Time elapsed: {end-start} seconds")
Time elapsed: 1.2332758903503418 seconds
>>> outputs
tensor([[    1,   589,  1459,    81,  7656,    81,  5860,  2262,   284,  1524,
             3,   284,  1459,   440,  8279,  5788, 15981,     2,  4014,   101,
           330,  8279,  5788, 20149,  2993,     0]])
>>> tokenizer.decode(outputs[0])
'<fim_prefix>def print_hello_world():\n    """<fim_suffix>\n    print("Hello world!")<fim_middle>Prints \'Hello world!\'"""<|endoftext|>'

simonw commented 10 months ago

>>> prompt_template = "<empty_output>SYSTEM {system}\n" \
...                   "<empty_output>USER {query}\n" \
...                   "<empty_output>ASSISTANT"
>>> prompt = prompt_template.format(system="You are a programming assistant",
...                                 query="How do I sort a list in Python?")
>>> 
>>> 
>>> inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
>>> outputs = model.generate(inputs, max_length=100, temperature=0.2)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
print("-"*80)
print(tokenizer.decode(outputs[0]))

>>> print("-"*80)
--------------------------------------------------------------------------------
>>> print(tokenizer.decode(outputs[0]))
<empty_output>SYSTEM You are a programming assistant
<empty_output>USER How do I sort a list in Python?
<empty_output>ASSISTANT To sort a list in Python, you can use the sorted() function. For example, if you have a list called my_list and you want to sort it in ascending order, you can use the following code: sorted_list = sorted(my_list).
<empty_output><|endoftext|>

simonw commented 10 months ago

prompt_template = (
    "<empty_output>SYSTEM {system}\n"
    "<empty_output>USER {query}\n"
    "<empty_output>ASSISTANT"
)

def prompt(query, system="You are a programming assistant"):
    prompt_s = prompt_template.format(system=system, query=query)
    start = time.time()
    inputs = tokenizer.encode(prompt_s, return_tensors="pt").to(device)
    outputs = model.generate(inputs, max_length=4096, temperature=0.2)
    result = tokenizer.decode(outputs[0])
    end = time.time()
    print(f"Generation took {end-start:.2f} seconds")
    print()
    print(result)
    return result

simonw commented 10 months ago

>>> prompt("jq to extract the id and title keys from an array of objects")
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Generation took 1.68 seconds

<empty_output>SYSTEM You are a programming assistant
<empty_output>USER jq to extract the id and title keys from an array of objects
<empty_output>ASSISTANT jq '.[] | {id, title}'
<empty_output><|endoftext|>
"<empty_output>SYSTEM You are a programming assistant\n<empty_output>USER jq to extract the id and title keys from an array of objects\n<empty_output>ASSISTANT jq '.[] | {id, title}'\n<empty_output><|endoftext|>"
>>> 
>>> 
>>> prompt("SQL to join the sales and locations table on sales.location_id and sum up the amount column per location")
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Generation took 5.46 seconds

<empty_output>SYSTEM You are a programming assistant
<empty_output>USER SQL to join the sales and locations table on sales.location_id and sum up the amount column per location
<empty_output>ASSISTANT SELECT location_id, SUM(amount) AS total_sales
FROM sales
JOIN locations ON sales.location_id = locations.location_id
GROUP BY location_id;
<empty_output><|endoftext|>
'<empty_output>SYSTEM You are a programming assistant\n<empty_output>USER SQL to join the sales and locations table on sales.location_id and sum up the amount column per location\n<empty_output>ASSISTANT SELECT location_id, SUM(amount) AS total_sales\nFROM sales\nJOIN locations ON sales.location_id = locations.location_id\nGROUP BY location_id;\n<empty_output><|endoftext|>'

simonw commented 10 months ago

Tried switching device to mps but got this:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 5, in prompt
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/transformers/generation/utils.py", line 1642, in generate
    return self.sample(
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/transformers/generation/utils.py", line 2724, in sample
    outputs = self(
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/simon/.cache/huggingface/modules/transformers_modules/smallcloudai/Refact-1_6B-fim/bdd25a3e0a3440b12964be202e27e9ee2cfce835/modeling_gpt_refact.py", line 540, in forward
    transformer_outputs = self.transformer(
  File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/Users/simon/.cache/huggingface/modules/transformers_modules/smallcloudai/Refact-1_6B-fim/bdd25a3e0a3440b12964be202e27e9ee2cfce835/modeling_gpt_refact.py", line 422, in forward
    alibi = get_alibi_biases(hidden_states.shape[0], seq_length_with_past,
NotImplementedError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
  File "/Users/simon/.cache/huggingface/modules/transformers_modules/smallcloudai/Refact-1_6B-fim/bdd25a3e0a3440b12964be202e27e9ee2cfce835/modeling_gpt_refact.py", line 103, in get_alibi_biases
        mask = torch.ones((T, T), device=dev, dtype=torch.bool)

    m = _get_slopes(attn_heads, dev)
        ~~~~~~~~~~~ <--- HERE

    # Calculate distances $[0, 1, \dots, N]$
  File "/Users/simon/.cache/huggingface/modules/transformers_modules/smallcloudai/Refact-1_6B-fim/bdd25a3e0a3440b12964be202e27e9ee2cfce835/modeling_gpt_refact.py", line 66, in _get_slopes
    m_0 = 2.0 ** (-8.0 / n)
    # $2^{-1\frac{8}{n}}, 2^{-2 \frac{8}{n}}, 2^{-3 \frac{8}{n}}, \dots$
    m = torch.pow(m_0, torch.arange(1, 1 + n, device=dev))
        ~~~~~~~~~ <--- HERE

    # If `n_heads` is not a power of 2, then we add the remaining slopes.
RuntimeError: The operator 'aten::pow.Scalar_out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.

simonw commented 10 months ago

I wonder if I could get this to stream? generate() docstring has this:

 |          streamer (`BaseStreamer`, *optional*):
 |              Streamer object that will be used to stream the generated sequences. Generated tokens are passed
 |              through `streamer.put(token_ids)` and the streamer is responsible for any further processing.

simonw commented 10 months ago

Useful example of streaming code: https://github.com/jerryjliu/llama_index/blob/1dffadee72addff5c66b6739909c06081b2673c2/llama_index/llms/huggingface.py#L234

simonw commented 10 months ago

I got a very basic LLM plugin working:

from transformers import AutoModelForCausalLM, AutoTokenizer
import llm

checkpoint = "smallcloudai/Refact-1_6B-fim"

prompt_template = (
    "<empty_output>SYSTEM {system}\n"
    "<empty_output>USER {query}\n"
    "<empty_output>ASSISTANT"
)

def run_prompt(model, device, tokenizer, query, system="You are a programming assistant"):
    prompt_s = prompt_template.format(system=system, query=query)
    inputs = tokenizer.encode(prompt_s, return_tensors="pt").to(device)
    outputs = model.generate(inputs, max_length=4096, temperature=0.2)
    result = tokenizer.decode(outputs[0])
    return result

@llm.hookimpl
def register_models(register):
    register(Refact(device="gpu"), aliases=("refact-cpu",))
    register(Refact(device="cpu"), aliases=("refact-gpu",))

class Refact(llm.Model):
    def __init__(self, device):
        self.model_id = "refact-{}".format(device)
        self._device = device
        self._model = None
        self._tokenizer = None

    def execute(self, prompt, stream, response, conversation):
        if self._tokenizer is None:
            self._tokenizer = AutoTokenizer.from_pretrained(checkpoint)
        if self._model is None:
            self._model = AutoModelForCausalLM.from_pretrained(
                checkpoint, trust_remote_code=True
            ).to(self._device)
        result = run_prompt(
            self._model,
            self._device,
            self._tokenizer,
            query=prompt.prompt,
            system=prompt.system or "You are a programming assistant",
        )
        return [result]

Needs conversation support, and also needs to parse the returned output and strip out everthing except the response. It currently does this:

llm -m refact-cpu 'Python to read rows as dicts from a CSV

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
<empty_output>SYSTEM You are a programming assistant
<empty_output>USER Python to read rows as dicts from a CSV
<empty_output>ASSISTANT def read_csv(file_path):
    with open(file_path, 'r') as file:
        reader = csv.DictReader(file)
        for row in reader:
            yield row
<empty_output><|endoftext|>

mitya52 commented 10 months ago

@simonw hi! Nice to see you're interested in our model. Need too say that we already have this model intergated to our open source coding assistance system. BTW model will work faster if you pass torch_dtype=torch.float16 in AutoModelForCausalLM.from_pretrained

mitya52 commented 10 months ago

@simonw about format

FIM works both PSM and SPM formats but SPM if preferrable for this model. Something like this: <fim_prefix>def print_hello_world():\n """<fim_suffix>\n print("Hello world!")<fim_middle> to <fim_suffix>\n print("Hello world!")<fim_prefix>def print_hello_world():\n """<fim_middle>

Chat format has some limitations with system prompt, see this code

simonw / public-notes

Try smallcloudai/Refact-1_6B-fim #12