python code that generate text from converted CoreML model

jean-anton commented 4 days ago

Hello,

I'm new to transformers and coreml and I have converted the model Llama-3.2-1B-Instruct from: https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct. to coreml model using python convert.py --model_dir /Users/jg/Documents/huggingface/models/Llama-3.2-1B-Instruct/ --output_dir ./Llama-3.2-1B-Instruct.mlpackage

can you please provide me a python code that generate text from the converted coreml model using the ANE device from my macbook M2?

the code I try to use I get errors like:

import os

import coremltools as ct
model = ct.models.MLModel('Llama-3.2-1B-Instruct.mlpackage', compute_units=ct.ComputeUnit.ALL)

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
        "meta-llama/Llama-3.2-1B-Instruct", token=os.environ["HF_TOKEN"]
    )

def generate_text(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors='np')

    input_dict = {
        'input_ids': inputs['input_ids'],
        'attention_mask': inputs['attention_mask']
    }

    predictions = model.predict(input_dict)

    generated_text = predictions['output_ids']

    generated_text = tokenizer.decode(generated_text[0], skip_special_tokens=True)

    return generated_text[:max_length]

prompt = 'Hello, how are you?'
generated_text = generate_text(prompt)
print(generated_text)

/Users/jg/Devel/Projects/Pycharm/CoreML_Llama3/.venv/bin/python /Users/jg/Devel/Projects/Pycharm/CoreML_Llama3/main_gen_text_lama405B_test1.py 
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
    - Avoid using `tokenizers` before the fork if possible
    - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Traceback (most recent call last):
  File "/Users/jg/Devel/Projects/Pycharm/CoreML_Llama3/main_gen_text_lama405B_test1.py", line 29, in <module>
    generated_text = generate_text(prompt)
  File "/Users/jg/Devel/Projects/Pycharm/CoreML_Llama3/main_gen_text_lama405B_test1.py", line 20, in generate_text
    predictions = model.predict(input_dict)
  File "/Users/jg/Devel/Projects/Pycharm/CoreML_Llama3/.venv/lib/python3.8/site-packages/coremltools/models/model.py", line 777, in predict
    return self._get_predictions(self.__proxy__,
  File "/Users/jg/Devel/Projects/Pycharm/CoreML_Llama3/.venv/lib/python3.8/site-packages/coremltools/models/model.py", line 825, in _get_predictions
    preprocess_method(data)
  File "/Users/jg/Devel/Projects/Pycharm/CoreML_Llama3/.venv/lib/python3.8/site-packages/coremltools/models/model.py", line 765, in verify_and_convert_input_dict
    self._verify_input_dict(d)
  File "/Users/jg/Devel/Projects/Pycharm/CoreML_Llama3/.venv/lib/python3.8/site-packages/coremltools/models/model.py", line 950, in _verify_input_dict
    self._verify_input_name_exists(input_dict)
  File "/Users/jg/Devel/Projects/Pycharm/CoreML_Llama3/.venv/lib/python3.8/site-packages/coremltools/models/model.py", line 993, in _verify_input_name_exists
    raise KeyError(err_msg.format(given_input, self._model_input_names_set))
KeyError: 'Provided key "attention_mask", in the input dict, does not match any of the model input name(s), which are: {\'input_ids\', \'causal_mask\'}'

helloNarehase commented 4 days ago

The attention_mask should be replaced with the causal_mask key!

helloNarehase commented 4 days ago

input_dict = {
    'input_ids': inputs['input_ids'],
    'causal_mask': inputs['attention_mask']
}

jean-anton commented 4 days ago

thank you for your response, but now I have other error, could you please give a full python code that you have tested for this model that can run with ANE?

new error:

Error: value type not convertible:
[[128000   9906     11   1268    527    499     30]]
Traceback (most recent call last):
  File "/Users/jg/Devel/Projects/Pycharm/CoreML_Llama3/main_gen_text_lama405B_test1.py", line 29, in <module>
    generated_text = generate_text(prompt)
  File "/Users/jg/Devel/Projects/Pycharm/CoreML_Llama3/main_gen_text_lama405B_test1.py", line 20, in generate_text
    predictions = model.predict(input_dict)
  File "/Users/jg/Devel/Projects/Pycharm/CoreML_Llama3/.venv/lib/python3.8/site-packages/coremltools/models/model.py", line 777, in predict
    return self._get_predictions(self.__proxy__,
  File "/Users/jg/Devel/Projects/Pycharm/CoreML_Llama3/.venv/lib/python3.8/site-packages/coremltools/models/model.py", line 827, in _get_predictions
    return proxy.predict(data, state)
RuntimeError: value type not convertible

helloNarehase commented 4 days ago

Unfortunately, the model currently does not work on ANE, and we are still researching this issue! I will update the inference-related logic shortly, so please wait. Thank you.

jean-anton commented 4 days ago

ok thank you, so I will wait for your update!

helloNarehase commented 3 days ago

Sorry for the delay! I’ve updated the code to make the model compatible with AutoTokenizer. Take a look at the “Inference.ipynb” file, and feel free to let me know if you run into any problems.

helloNarehase commented 3 days ago

model outputs tokens, not text.

jean-anton commented 3 days ago

Thank you for your response, but I have tried to get the text out off the tokens without success. Could you please give a code to get that, I tried :

from transformers import AutoTokenizer
import coremltools as ct
import os
import numpy as np

model_path = "Llama-3.2-1B-Instruct.mlpackage"

tokenizer = AutoTokenizer.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct", token=os.environ["HF_TOKEN"]
)

mlmodel_fp16 = ct.models.MLModel(model_path)

inputs = tokenizer("Hello how are you?", return_tensors='np')

tok = inputs['input_ids']

st_len = tok.shape[-1]

state = mlmodel_fp16.make_state()  # 루프 내에서 상태 초기화

max_length = 100  # Maximum length of the generated response
eos_token_id = tokenizer.eos_token_id  # EOS token ID

temperature = 0.7  # Temperature parameter

while st_len < max_length:
    mask = np.full((1, st_len := st_len + 1), -1e9)
    mask = np.triu(mask, k=1)
    mask = np.hstack(
        [np.zeros((1, 1)), mask]
    )[None, None, :, :]

    input_dict = {
        'input_ids': tok.astype(np.int32),
        'causal_mask': mask.astype(np.int32)

    }

    preds = mlmodel_fp16.predict(input_dict, state=state)

    logits = preds['logits']
    logits = logits / temperature
    probs = np.exp(logits) / np.sum(np.exp(logits))
    pre_toks = np.random.choice(logits.shape[-1], p=probs[0])

    tok = np.concatenate([tok, [[pre_toks]]], axis=1)

    if pre_toks == eos_token_id:
        break

# Decode the generated tokens
output_text = tokenizer.decode(tok[0].tolist(), skip_special_tokens=True)

print(output_text)

but I get :

 hello how are you? (1) Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds, Theodds

Process finished with exit code 0

helloNarehase / CoreML_Llama3

python code that generate text from converted CoreML model #2