intel / neural-speed

An innovative library for efficient LLM inference via low-bit quantization
https://github.com/intel/neural-speed
Apache License 2.0
344 stars 37 forks source link

Loading checkpoint shards takes too long #251

Open irjawais opened 4 months ago

irjawais commented 4 months ago

When I load "meta-llama/Meta-Llama-3-8B-Instruct" model like this

from transformers import AutoTokenizer, TextStreamer from intel_extension_for_transformers.transformers import AutoModelForCausalLM model_name = "meta-llama/Meta-Llama-3-8B-Instruct" # Hugging Face model_id or local model tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) streamer = TextStreamer(tokenizer) model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)

it got hanged. Then only way is to restart instance to recover it.

Is there any issue in my spec?

my instance spec ubunu 32 GB RAM.

irjawais commented 4 months ago

warnings.warn( Loading checkpoint shards: 75%|█████████████████████████████████████████████████████████████████████████████████████████████████████████ | 3/4 [01:53<00:37, 37.72s/it]Traceback (most recent call last): File "", line 1, in File "/usr/local/lib/python3.10/dist-packages/intel_extension_for_transformers/transformers/modeling/modeling_auto.py", line 593, in from_pretrained model.init( # pylint: disable=E1123 File "/usr/local/lib/python3.10/dist-packages/neural_speed/init.py", line 182, in init assert os.path.exists(fp32_bin), "Fail to convert pytorch model" AssertionError: Fail to convert pytorch model

intellinjun commented 4 months ago

@irjawais Can you check the memory usage when converting the model? From your description, it seems that there may be insufficient memory.