intel / intel-extension-for-transformers

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
Apache License 2.0
2.14k stars 211 forks source link

Segmentation fault (core dumped) #917

Closed mryvae closed 11 months ago

mryvae commented 11 months ago

When I try to use this demo: ` from transformers import AutoTokenizer, TextStreamer from intel_extension_for_transformers.transformers import AutoModelForCausalLM model_name = "/itrex/neural-chat-7b-v1-1" # Hugging Face model_id or local model prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, local_files_only=True) inputs = tokenizer(prompt, return_tensors="pt").input_ids streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, local_files_only=True, load_in_4bit=True, use_llm_runtime=True) outputs = model.generate(inputs, streamer=streamer, max_new_tokens=30) `

The logs I get are as follows: model_quantize_internal: model size = 26148.23 MB model_quantize_internal: quant size = 4884.87 MB AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:0 AVX512_VNNI:1 AMX_INT8:0 AMX_BF16:0 AVX512_BF16:0 AVX512_FP16:0 beam_size: 1, do_sample: 0, top_k: 40, top_p: 0.950000 model.cpp: loading model from runtime_outs/ne_mpt_q_int4_jblas_cint8_g32.bin init: n_vocab = 50279 init: n_embd = 4096 init: n_mult = 4096 init: n_head = 32 init: n_layer = 32 init: n_rot = 32 init: n_ff = 16384 init: n_parts = 1 load: ne ctx size = 4884.93 MB load: mem required = 13076.93 MB (+ memory per state) Segmentation fault (core dumped)

Why is this happening?

I have confirmed that the RAM is large enough to load the LLM Model. (neuralchat) root@9a18ae20d7ff:/intel-extension-for-transformers/script# free -h total used free shared buff/cache available Mem: 251Gi 7.2Gi 111Gi 34Mi 131Gi 242Gi Swap: 976Mi 0B 976Mi

kunger97 commented 11 months ago

Have you solved this problem and how?