Error at Colab Inference of neural-chat-7b-v3-1 Model

dopc commented 10 months ago

Hey! Thanks for the great project and for sharing it with the community.

I am trying to inference with the HF neural-chat model.

What I tried

In Colab,

!pip install intel-extension-for-transformers intel-extension-for-pytorch accelerate datasets neural-speed -q

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

Behaviour

I got an error and below is the full trace.

No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
2024-01-24 07:20:44 [INFO] cpu device is used.
2024-01-24 07:20:44 [INFO] Applying Weight Only Quantization.
2024-01-24 07:20:44 [INFO] Using LLM runtime.

cmd: ['python', PosixPath('/usr/local/lib/python3.10/dist-packages/neural_speed/convert/convert_mistral.py'), '--outfile', 'runtime_outs/ne_mistral_f32.bin', '--outtype', 'f32', 'Intel/neural-chat-7b-v3-1']

---------------------------------------------------------------------------

AssertionError                            Traceback (most recent call last)

[<ipython-input-2-c1da0f81c837>](https://localhost:8080/#) in <cell line: 10>()
      8 streamer = TextStreamer(tokenizer)
      9 
---> 10 model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
     11 outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

1 frames

[/usr/local/lib/python3.10/dist-packages/neural_speed/__init__.py](https://localhost:8080/#) in init(self, model_name, use_quant, use_cache, use_gptq, use_awq, weight_dtype, alg, group_size, scale_dtype, compute_dtype, use_ggml)
    115         if not use_cache or not os.path.exists(fp32_bin):
    116             convert_model(model_name, fp32_bin, "f32")
--> 117             assert os.path.exists(fp32_bin), "Fail to convert pytorch model"
    118 
    119         if not use_quant:

AssertionError: Fail to convert pytorch model

What I ask

Why I get this error while trying the official example?
How can I handle this error?

Thanks.

Zhenzhong1 commented 10 months ago

@dopc Hi, the reason was that we didn't support loadding this model from the HF directly before. model_name = "Intel/neural-chat-7b-v3-1" should be a local path.

We have update this feature. https://github.com/intel/neural-speed/pull/93. We support the local path & HF both currently.

Please reinstall the Neural Speed from the souce code and try using the HF card id again.

My tests:

dopc commented 10 months ago

Thanks @Zhenzhong1 👍

Zhenzhong1 commented 9 months ago

You are welcome~

I closed this issue. If you have more questions, please feel free to ask.

intel / neural-speed