intel-analytics / ipex-llm

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Mixtral, Gemma, Phi, MiniCPM, Qwen-VL, MiniCPM-V, etc.) on Intel XPU (e.g., local PC with iGPU and NPU, discrete GPU such as Arc, Flex and Max); seamlessly integrate with llama.cpp, Ollama, HuggingFace, LangChain, LlamaIndex, vLLM, GraphRAG, DeepSpeed, Axolotl, etc
Apache License 2.0
6.64k stars 1.26k forks source link

when inference chatglm2-6b with bigdl-llm and fp8, second text generation is crashed #9331

Open Fred-cell opened 12 months ago

Fred-cell commented 12 months ago

bigdl-core-xe 2.4.0b20231101 bigdl-core-xe-esimd 2.4.0b20231101 bigdl-llm 2.4.0b20231101

input prompt 2016, max_new_tokens=1024, the first inference is ok, the second has error "segmentation fault" as below: Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00, 1.39s/it] 2023-11-01 22:15:38,504 - bigdl.llm.transformers.utils - INFO - Converting the current model to fp8 format...... <class 'transformers_modules.modeling_chatglm.ChatGLMForConditionalGeneration'> /root/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/chatglm2.py:137: UserWarning: IPEX XPU dedicated fusion passes are enabled in ScriptGraph non profiling execution mode. Please enable profiling execution mode to retrieve device guard. (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/jit/fusion_pass.cpp:826.) query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb) =========First token cost 6.7524 s========= =========Rest tokens cost average 0.0391 s (574 tokens in all)========= Segmentation fault (core dumped)

hkvision commented 12 months ago

@cyita Take a look at it?

cyita commented 12 months ago

Hi Fred, I failed to reproduce your error, maybe try to unset SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS? image

cyita commented 12 months ago

You can raise the system open file limit using ulimit -n 3000