when inference chatglm2-6b with bigdl-llm and fp8, second text generation is crashed

Fred-cell commented 12 months ago

bigdl-core-xe 2.4.0b20231101 bigdl-core-xe-esimd 2.4.0b20231101 bigdl-llm 2.4.0b20231101

input prompt 2016, max_new_tokens=1024, the first inference is ok, the second has error "segmentation fault" as below: Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:09<00:00, 1.39s/it] 2023-11-01 22:15:38,504 - bigdl.llm.transformers.utils - INFO - Converting the current model to fp8 format...... <class 'transformers_modules.modeling_chatglm.ChatGLMForConditionalGeneration'> /root/anaconda3/envs/bigdl-llm/lib/python3.10/site-packages/bigdl/llm/transformers/models/chatglm2.py:137: UserWarning: IPEX XPU dedicated fusion passes are enabled in ScriptGraph non profiling execution mode. Please enable profiling execution mode to retrieve device guard. (Triggered internally at /build/intel-pytorch-extension/csrc/gpu/jit/fusion_pass.cpp:826.) query_layer = apply_rotary_pos_emb(query_layer, rotary_pos_emb) =========First token cost 6.7524 s========= =========Rest tokens cost average 0.0391 s (574 tokens in all)========= Segmentation fault (core dumped)

hkvision commented 12 months ago

@cyita Take a look at it?

cyita commented 12 months ago

Hi Fred, I failed to reproduce your error, maybe try to unset SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS?

cyita commented 12 months ago

You can raise the system open file limit using ulimit -n 3000

intel-analytics / ipex-llm

when inference chatglm2-6b with bigdl-llm and fp8, second text generation is crashed #9331

bigdl-core-xe 2.4.0b20231101 bigdl-core-xe-esimd 2.4.0b20231101 bigdl-llm 2.4.0b20231101