Open simonw opened 10 months ago
>>> checkpoint = "smallcloudai/Refact-1_6B-fim"
>>> device = "cpu"
>>> from transformers import AutoModelForCausalLM, AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained(checkpoint)
Downloading (…)okenizer_config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 717/717 [00:00<00:00, 1.27MB/s]
Downloading (…)olve/main/vocab.json: 100%|██████████████████████████████████████████████████████████████████████████████████████| 777k/777k [00:00<00:00, 5.37MB/s]
Downloading (…)olve/main/merges.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████| 442k/442k [00:00<00:00, 3.40MB/s]
Downloading (…)/main/tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████| 2.06M/2.06M [00:00<00:00, 8.14MB/s]
Downloading (…)cial_tokens_map.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 532/532 [00:00<00:00, 1.54MB/s]
>>>
>>> model = AutoModelForCausalLM.from_pretrained(checkpoint, trust_remote_code=True).to(device)
Downloading (…)lve/main/config.json: 100%|████████████████████████████████████████████████████████████████████████████████████████| 764/764 [00:00<00:00, 2.29MB/s]
Downloading (…)ration_gpt_refact.py: 100%|████████████████████████████████████████████████████████████████████████████████████| 2.00k/2.00k [00:00<00:00, 4.00MB/s]
A new version of the following files was downloaded from https://huggingface.co/smallcloudai/Refact-1_6B-fim:
- configuration_gpt_refact.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading (…)deling_gpt_refact.py: 100%|████████████████████████████████████████████████████████████████████████████████████| 23.8k/23.8k [00:00<00:00, 58.1MB/s]
A new version of the following files was downloaded from https://huggingface.co/smallcloudai/Refact-1_6B-fim:
- modeling_gpt_refact.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
Downloading pytorch_model.bin: 9%|███████▊ | 545M/6.34G [00:13<02:28, 39.0MB/s]
% find .cache | grep deling
.cache/huggingface/modules/transformers_modules/smallcloudai/Refact-1_6B-fim/bdd25a3e0a3440b12964be202e27e9ee2cfce835/modeling_gpt_refact.py
.cache/huggingface/hub/models--smallcloudai--Refact-1_6B-fim/snapshots/bdd25a3e0a3440b12964be202e27e9ee2cfce835/modeling_gpt_refact.py
import time
start = time.time()
prompt = '<fim_prefix>def print_hello_world():\n """<fim_suffix>\n print("Hello world!")<fim_middle>'
inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=100, temperature=0.2)
print("-"*80)
print(tokenizer.decode(outputs[0]))
end = time.time()
print(f"Time elapsed: {end-start} seconds")
>>> outputs = model.generate(inputs, max_length=100, temperature=0.2)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
print("-"*80)
print(tokenizer.decode(outputs[0]))
end = time.time()
print(f"Time elapsed: {end-start} seconds")
>>> print("-"*80)
--------------------------------------------------------------------------------
>>> print(tokenizer.decode(outputs[0]))
<fim_prefix>def print_hello_world():
"""<fim_suffix>
print("Hello world!")<fim_middle>Prints 'Hello world!'"""<|endoftext|>
>>>
>>> end = time.time()
>>>
>>> print(f"Time elapsed: {end-start} seconds")
Time elapsed: 1.2332758903503418 seconds
>>> outputs
tensor([[ 1, 589, 1459, 81, 7656, 81, 5860, 2262, 284, 1524,
3, 284, 1459, 440, 8279, 5788, 15981, 2, 4014, 101,
330, 8279, 5788, 20149, 2993, 0]])
>>> tokenizer.decode(outputs[0])
'<fim_prefix>def print_hello_world():\n """<fim_suffix>\n print("Hello world!")<fim_middle>Prints \'Hello world!\'"""<|endoftext|>'
>>> prompt_template = "<empty_output>SYSTEM {system}\n" \
... "<empty_output>USER {query}\n" \
... "<empty_output>ASSISTANT"
>>> prompt = prompt_template.format(system="You are a programming assistant",
... query="How do I sort a list in Python?")
>>>
>>>
>>> inputs = tokenizer.encode(prompt, return_tensors="pt").to(device)
>>> outputs = model.generate(inputs, max_length=100, temperature=0.2)
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
print("-"*80)
print(tokenizer.decode(outputs[0]))
>>> print("-"*80)
--------------------------------------------------------------------------------
>>> print(tokenizer.decode(outputs[0]))
<empty_output>SYSTEM You are a programming assistant
<empty_output>USER How do I sort a list in Python?
<empty_output>ASSISTANT To sort a list in Python, you can use the sorted() function. For example, if you have a list called my_list and you want to sort it in ascending order, you can use the following code: sorted_list = sorted(my_list).
<empty_output><|endoftext|>
prompt_template = (
"<empty_output>SYSTEM {system}\n"
"<empty_output>USER {query}\n"
"<empty_output>ASSISTANT"
)
def prompt(query, system="You are a programming assistant"):
prompt_s = prompt_template.format(system=system, query=query)
start = time.time()
inputs = tokenizer.encode(prompt_s, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=4096, temperature=0.2)
result = tokenizer.decode(outputs[0])
end = time.time()
print(f"Generation took {end-start:.2f} seconds")
print()
print(result)
return result
>>> prompt("jq to extract the id and title keys from an array of objects")
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Generation took 1.68 seconds
<empty_output>SYSTEM You are a programming assistant
<empty_output>USER jq to extract the id and title keys from an array of objects
<empty_output>ASSISTANT jq '.[] | {id, title}'
<empty_output><|endoftext|>
"<empty_output>SYSTEM You are a programming assistant\n<empty_output>USER jq to extract the id and title keys from an array of objects\n<empty_output>ASSISTANT jq '.[] | {id, title}'\n<empty_output><|endoftext|>"
>>>
>>>
>>> prompt("SQL to join the sales and locations table on sales.location_id and sum up the amount column per location")
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
Generation took 5.46 seconds
<empty_output>SYSTEM You are a programming assistant
<empty_output>USER SQL to join the sales and locations table on sales.location_id and sum up the amount column per location
<empty_output>ASSISTANT SELECT location_id, SUM(amount) AS total_sales
FROM sales
JOIN locations ON sales.location_id = locations.location_id
GROUP BY location_id;
<empty_output><|endoftext|>
'<empty_output>SYSTEM You are a programming assistant\n<empty_output>USER SQL to join the sales and locations table on sales.location_id and sum up the amount column per location\n<empty_output>ASSISTANT SELECT location_id, SUM(amount) AS total_sales\nFROM sales\nJOIN locations ON sales.location_id = locations.location_id\nGROUP BY location_id;\n<empty_output><|endoftext|>'
Tried switching device to mps
but got this:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 5, in prompt
File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/transformers/generation/utils.py", line 1642, in generate
return self.sample(
File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/transformers/generation/utils.py", line 2724, in sample
outputs = self(
File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/simon/.cache/huggingface/modules/transformers_modules/smallcloudai/Refact-1_6B-fim/bdd25a3e0a3440b12964be202e27e9ee2cfce835/modeling_gpt_refact.py", line 540, in forward
transformer_outputs = self.transformer(
File "/Users/simon/.local/share/virtualenvs/llm-p4p8CDpq/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/Users/simon/.cache/huggingface/modules/transformers_modules/smallcloudai/Refact-1_6B-fim/bdd25a3e0a3440b12964be202e27e9ee2cfce835/modeling_gpt_refact.py", line 422, in forward
alibi = get_alibi_biases(hidden_states.shape[0], seq_length_with_past,
NotImplementedError: The following operation failed in the TorchScript interpreter.
Traceback of TorchScript (most recent call last):
File "/Users/simon/.cache/huggingface/modules/transformers_modules/smallcloudai/Refact-1_6B-fim/bdd25a3e0a3440b12964be202e27e9ee2cfce835/modeling_gpt_refact.py", line 103, in get_alibi_biases
mask = torch.ones((T, T), device=dev, dtype=torch.bool)
m = _get_slopes(attn_heads, dev)
~~~~~~~~~~~ <--- HERE
# Calculate distances $[0, 1, \dots, N]$
File "/Users/simon/.cache/huggingface/modules/transformers_modules/smallcloudai/Refact-1_6B-fim/bdd25a3e0a3440b12964be202e27e9ee2cfce835/modeling_gpt_refact.py", line 66, in _get_slopes
m_0 = 2.0 ** (-8.0 / n)
# $2^{-1\frac{8}{n}}, 2^{-2 \frac{8}{n}}, 2^{-3 \frac{8}{n}}, \dots$
m = torch.pow(m_0, torch.arange(1, 1 + n, device=dev))
~~~~~~~~~ <--- HERE
# If `n_heads` is not a power of 2, then we add the remaining slopes.
RuntimeError: The operator 'aten::pow.Scalar_out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on https://github.com/pytorch/pytorch/issues/77764. As a temporary fix, you can set the environment variable `PYTORCH_ENABLE_MPS_FALLBACK=1` to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
I wonder if I could get this to stream? generate()
docstring has this:
| streamer (`BaseStreamer`, *optional*):
| Streamer object that will be used to stream the generated sequences. Generated tokens are passed
| through `streamer.put(token_ids)` and the streamer is responsible for any further processing.
Useful example of streaming code: https://github.com/jerryjliu/llama_index/blob/1dffadee72addff5c66b6739909c06081b2673c2/llama_index/llms/huggingface.py#L234
I got a very basic LLM plugin working:
from transformers import AutoModelForCausalLM, AutoTokenizer
import llm
checkpoint = "smallcloudai/Refact-1_6B-fim"
prompt_template = (
"<empty_output>SYSTEM {system}\n"
"<empty_output>USER {query}\n"
"<empty_output>ASSISTANT"
)
def run_prompt(model, device, tokenizer, query, system="You are a programming assistant"):
prompt_s = prompt_template.format(system=system, query=query)
inputs = tokenizer.encode(prompt_s, return_tensors="pt").to(device)
outputs = model.generate(inputs, max_length=4096, temperature=0.2)
result = tokenizer.decode(outputs[0])
return result
@llm.hookimpl
def register_models(register):
register(Refact(device="gpu"), aliases=("refact-cpu",))
register(Refact(device="cpu"), aliases=("refact-gpu",))
class Refact(llm.Model):
def __init__(self, device):
self.model_id = "refact-{}".format(device)
self._device = device
self._model = None
self._tokenizer = None
def execute(self, prompt, stream, response, conversation):
if self._tokenizer is None:
self._tokenizer = AutoTokenizer.from_pretrained(checkpoint)
if self._model is None:
self._model = AutoModelForCausalLM.from_pretrained(
checkpoint, trust_remote_code=True
).to(self._device)
result = run_prompt(
self._model,
self._device,
self._tokenizer,
query=prompt.prompt,
system=prompt.system or "You are a programming assistant",
)
return [result]
Needs conversation support, and also needs to parse the returned output and strip out everthing except the response. It currently does this:
llm -m refact-cpu 'Python to read rows as dicts from a CSV
The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.
<empty_output>SYSTEM You are a programming assistant
<empty_output>USER Python to read rows as dicts from a CSV
<empty_output>ASSISTANT def read_csv(file_path):
with open(file_path, 'r') as file:
reader = csv.DictReader(file)
for row in reader:
yield row
<empty_output><|endoftext|>
@simonw hi! Nice to see you're interested in our model. Need too say that we already have this model intergated to our open source coding assistance system. BTW model will work faster if you pass torch_dtype=torch.float16 in AutoModelForCausalLM.from_pretrained
@simonw about format
FIM works both PSM and SPM formats but SPM if preferrable for this model.
Something like this:
<fim_prefix>def print_hello_world():\n """<fim_suffix>\n print("Hello world!")<fim_middle>
to
<fim_suffix>\n print("Hello world!")<fim_prefix>def print_hello_world():\n """<fim_middle>
Chat format has some limitations with system prompt, see this code
https://huggingface.co/smallcloudai/Refact-1_6B-fim - via https://news.ycombinator.com/item?id=37381862