When I try to use this demo:
`
from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "/itrex/neural-chat-7b-v1-1" # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"
The logs I get are as follows:
model_quantize_internal: model size = 26148.23 MB
model_quantize_internal: quant size = 4884.87 MB
AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:0 AVX512_VNNI:1 AMX_INT8:0 AMX_BF16:0 AVX512_BF16:0 AVX512_FP16:0
beam_size: 1, do_sample: 0, top_k: 40, top_p: 0.950000
model.cpp: loading model from runtime_outs/ne_mpt_q_int4_jblas_cint8_g32.bin
init: n_vocab = 50279
init: n_embd = 4096
init: n_mult = 4096
init: n_head = 32
init: n_layer = 32
init: n_rot = 32
init: n_ff = 16384
init: n_parts = 1
load: ne ctx size = 4884.93 MB
load: mem required = 13076.93 MB (+ memory per state)
Segmentation fault (core dumped)
Why is this happening?
I have confirmed that the RAM is large enough to load the LLM Model.
(neuralchat) root@9a18ae20d7ff:/intel-extension-for-transformers/script# free -h
total used free shared buff/cache available
Mem: 251Gi 7.2Gi 111Gi 34Mi 131Gi 242Gi
Swap: 976Mi 0B 976Mi
When I try to use this demo: ` from transformers import AutoTokenizer, TextStreamer from intel_extension_for_transformers.transformers import AutoModelForCausalLM model_name = "/itrex/neural-chat-7b-v1-1" # Hugging Face model_id or local model prompt = "Once upon a time, there existed a little girl,"
tokenizer = AutoTokenizer.from_pretrained(model_name, local_files_only=True) inputs = tokenizer(prompt, return_tensors="pt").input_ids streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, local_files_only=True, load_in_4bit=True, use_llm_runtime=True) outputs = model.generate(inputs, streamer=streamer, max_new_tokens=30) `
The logs I get are as follows: model_quantize_internal: model size = 26148.23 MB model_quantize_internal: quant size = 4884.87 MB AVX:1 AVX2:1 AVX512F:1 AVX_VNNI:0 AVX512_VNNI:1 AMX_INT8:0 AMX_BF16:0 AVX512_BF16:0 AVX512_FP16:0 beam_size: 1, do_sample: 0, top_k: 40, top_p: 0.950000 model.cpp: loading model from runtime_outs/ne_mpt_q_int4_jblas_cint8_g32.bin init: n_vocab = 50279 init: n_embd = 4096 init: n_mult = 4096 init: n_head = 32 init: n_layer = 32 init: n_rot = 32 init: n_ff = 16384 init: n_parts = 1 load: ne ctx size = 4884.93 MB load: mem required = 13076.93 MB (+ memory per state) Segmentation fault (core dumped)
Why is this happening?
I have confirmed that the RAM is large enough to load the LLM Model. (neuralchat) root@9a18ae20d7ff:/intel-extension-for-transformers/script# free -h total used free shared buff/cache available Mem: 251Gi 7.2Gi 111Gi 34Mi 131Gi 242Gi Swap: 976Mi 0B 976Mi