intel / xFasterTransformer

Apache License 2.0
270 stars 53 forks source link

Crash when using CB mode with multi-rank #440

Closed a3213105 closed 1 month ago

a3213105 commented 1 month ago

`RUN_WORKLOAD="python /root/test.py -m /mnt/nvme1/llm_model/chatglm3-6b-32k-cpu/ -t /mnt/nvme1/llm_model/chatglm3-6b-32k -d bf16 --kv_cache_dtype int8 -c 1"

OMP_NUM_THREADS=10 LD_PRELOAD=libiomp5.so mpirun \ -n 1 numactl -N 0 -p 8 ${RUN_WORKLOAD} : \ -n 1 numactl -N 1 -p 9 ${RUN_WORKLOAD} : \ -n 1 numactl -N 2 -p 10 ${RUN_WORKLOAD} : \ -n 1 numactl -N 3 -p 11 ${RUN_WORKLOAD} : \ -n 1 numactl -N 4 -p 12 ${RUN_WORKLOAD} : \ -n 1 numactl -N 5 -p 13 ${RUN_WORKLOAD} : \ -n 1 numactl -N 6 -p 14 ${RUN_WORKLOAD} : \ -n 1 numactl -N 7 -p 15 ${RUN_WORKLOAD}`

crash results:

ENABLE_TUNED_COMM is enabled for faster reduceAdd. ENABLE_TUNED_COMM is enabled for faster reduceAdd. ENABLE_TUNED_COMM is enabled for faster reduceAdd. ENABLE_TUNED_COMM is enabled for faster reduceAdd. ENABLE_TUNED_COMM is enabled for faster reduceAdd. ENABLE_TUNED_COMM is enabled for faster reduceAdd. ENABLE_TUNED_COMM is enabled for faster reduceAdd. ENABLE_TUNED_COMM is enabled for faster reduceAdd. [hbm01:50418:0:50418] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x563774802740) [hbm01:50419:0:50419] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x59c87fe71840) [hbm01:50420:0:50420] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x6226e4433d80) [hbm01:50424:0:50424] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x58213b842bc0) [hbm01:50425:0:50425] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x6385bc3d80c0) [hbm01:50421:0:50421] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x63b865c08d80) [hbm01:50422:0:50422] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x603dd065ff80) [hbm01:50423:0:50423] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x63c50dcbc900) malloc(): corrupted top size malloc(): corrupted top size malloc(): corrupted top size malloc(): corrupted top size malloc(): corrupted top size ==== backtrace (tid: 50424) ==== 0 /lib/x86_64-linux-gnu/libucs.so.0(ucs_handle_error+0x2e4) [0x72d6c3a73fc4] 1 /lib/x86_64-linux-gnu/libucs.so.0(+0x24fec) [0x72d6c3a77fec] 2 /lib/x86_64-linux-gnu/libucs.so.0(+0x251aa) [0x72d6c3a781aa] 3 /lib/x86_64-linux-gnu/libc.so.6(+0x42520) [0x72d6fb842520] 4 /lib/x86_64-linux-gnu/libc.so.6(+0x1a6b55) [0x72d6fb9a6b55] 5 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZN3xft5Model7forwardEb+0x71b) [0x72d6bb7c133d] 6 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZN14TorchAutoModel9forwardCBEv+0x65) [0x72d6bb78142d] 7 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZSt13invoke_implIN2at6TensorERKM14TorchAutoModelFS1_vERS2_JEET_St19invoke_memfun_refOT0_OT1DpOT2+0x84) [0x72d6bb7b42b7] 8 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZSt8invokeIRKM14TorchAutoModelFN2at6TensorEvEJRS0_EENSt15invoke_resultIT_JDpT0_EE4typeEOS9DpOSA+0x58) [0x72d6bb7b31f4] 9 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZNKSt12_Mem_fn_baseIM14TorchAutoModelFN2at6TensorEvELb1EEclIJRS0_EEEDTcl8invokedtdefpT6_M_pmfspcl7forwardIT_EfpEEEDpOS8+0x49) [0x72d6bb7b2313] 10 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZN3c104guts6invokeIRM14TorchAutoModelFN2at6TensorEvEJRS2_EEENSt9enable_ifIX19is_member_pointer_vINSt5decayIT_E4typeEEENSt13invoke_resultISB_JDpT0_EE4typeEE4typeEOSBDpOSF+0x6f) [0x72d6bb7b07b5] 11 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZN5torch6detail10WrapMethodIM14TorchAutoModelFN2at6TensorEvEEclEN3c1013intrusive_ptrIS2_NS8_6detail34intrusive_target_default_null_typeIS2_EEEE+0x49) [0x72d6bb7ae54b] 12 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZN5torch6detail32call_torchbind_method_from_stackINS0_10WrapMethodIM14TorchAutoModelFN2at6TensorEvEEELb0EJLm0EEEEN3c104guts23infer_function_traits_t11return_typeERT_RSt6vectorINS9_6IValueESaISG_EESt16integer_sequenceImJXspT1_EEE+0x6f) [0x72d6bb7aa9f4] 13 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZN5torch6detail32call_torchbind_method_from_stackINS0_10WrapMethodIM14TorchAutoModelFN2at6TensorEvEEELb0EEEN3c104guts23infer_function_traits_t11return_typeERT_RSt6vectorINS9_6IValueESaISG_EE+0x46) [0x72d6bb7a3811] 14 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZN5torch6detail10BoxedProxyIN2at6TensorENS0_10WrapMethodIM14TorchAutoModelFS3_vEEEEclERSt6vectorIN3c106IValueESaISCEERS8+0x3f) [0x72d6bb79e5bf] 15 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZZN5torch6class_I14TorchAutoModelE12defineMethodINS_6detail10WrapMethodIMS1_FN2at6TensorEvEEEEEPNS_3jit8FunctionESsT_SsSt16initializer_listINS_3argEEENUlRSt6vectorIN3c106IValueESaISK_EEEclESN+0x3a) [0x72d6bb796686] 16 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZSt13invoke_implIvRZN5torch6class_I14TorchAutoModelE12defineMethodINS0_6detail10WrapMethodIMS2_FN2at6TensorEvEEEEEPNS0_3jit8FunctionESsT_SsSt16initializer_listINS0_3argEEEUlRSt6vectorIN3c106IValueESaISL_EEE_JSO_EESF_St14invoke_otherOT0DpOT1+0x3b) [0x72d6bb7b084d] 17 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZSt10invoke_rIvRZN5torch6class_I14TorchAutoModelE12defineMethodINS0_6detail10WrapMethodIMS2_FN2at6TensorEvEEEEEPNS0_3jit8FunctionESsT_SsSt16initializer_listINS0_3argEEEUlRSt6vectorIN3c106IValueESaISL_EEE_JSO_EENSt9enable_ifIX16is_invocable_r_vISF_T0_DpT1_EESF_E4typeEOSSDpOST+0x3b) [0x72d6bb7ae623] 18 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZNSt17_Function_handlerIFvRSt6vectorIN3c106IValueESaIS2_EEEZN5torch6class_I14TorchAutoModelE12defineMethodINS7_6detail10WrapMethodIMS9_FN2at6TensorEvEEEEEPNS7_3jit8FunctionESsT_SsSt16initializer_listINS7_3argEEEUlS5_E_E9_M_invokeERKSt9_AnydataS5+0x3b) [0x72d6bb7aaac1] 19 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZNKSt8functionIFvRSt6vectorIN3c106IValueESaIS2EEEEclES5+0x4d) [0x72d6bb7884c1] 20 /root/miniforge3/envs/xft/lib/python3.9/site-packages/xfastertransformer-1.7.0-py3.9-linux-x86_64.egg/xfastertransformer/libxfastertransformer_pt.so(_ZN5torch3jit17BuiltinOpFunction3runERSt6vectorIN3c106IValueESaIS4_EE+0x2b) [0x72d6bb77d89d] 21 /root/miniforge3/envs/xft/lib/python3.9/site-packages/torch/lib/libtorch_python.so(+0xa10c6e) [0x72d6fa010c6e] 22 /root/miniforge3/envs/xft/lib/python3.9/site-packages/torch/lib/libtorch_python.so(+0xaf4581) [0x72d6fa0f4581] 23 /root/miniforge3/envs/xft/lib/python3.9/site-packages/torch/lib/libtorch_python.so(+0xab54fa) [0x72d6fa0b54fa] 24 /root/miniforge3/envs/xft/lib/python3.9/site-packages/torch/lib/libtorch_python.so(+0xab5728) [0x72d6fa0b5728] 25 /root/miniforge3/envs/xft/lib/python3.9/site-packages/torch/lib/libtorch_python.so(+0x4847bf) [0x72d6f9a847bf] 26 python(+0x15ef25) [0x58212cd0ef25] 27 python(_PyObject_MakeTpCall+0x316) [0x58212ccf5ba6] 28 python(+0x1a0791) [0x58212cd50791] 29 python(_PyObject_Call+0x10b) [0x58212cd0f32b] 30 python(+0xb9a38) [0x58212cc69a38] 31 python(_PyObject_MakeTpCall+0x316) [0x58212ccf5ba6] 32 python(_PyEval_EvalFrameDefault+0x535b) [0x58212cd93dbb] 33 python(_PyFunction_Vectorcall+0x19a) [0x58212cd4f88a] 34 python(_PyEval_EvalFrameDefault+0x609) [0x58212cd8f069] 35 python(_PyFunction_Vectorcall+0x19a) [0x58212cd4f88a] 36 python(_PyEval_EvalFrameDefault+0x3bc) [0x58212cd8ee1c] 37 python(+0x138550) [0x58212cce8550] 38 python(_PyEval_EvalCodeWithName+0x47) [0x58212cdcf047] 39 python(PyEval_EvalCodeEx+0x39) [0x58212cdcf089] 40 python(PyEval_EvalCode+0x1b) [0x58212cdcf0ab] 41 python(+0x251909) [0x58212ce01909] 42 python(+0x28c3a4) [0x58212ce3c3a4] 43 python(+0x118d33) [0x58212ccc8d33] 44 python(PyRun_SimpleFileExFlags+0x19c) [0x58212ce4683c] 45 python(Py_RunMain+0x395) [0x58212ce46f05] 46 python(Py_BytesMain+0x39) [0x58212ce47059] 47 /lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x72d6fb829d90] 48 /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x72d6fb829e40] 49 python(+0x20bf1d) [0x58212cdbbf1d] malloc(): corrupted top size malloc(): corrupted top size

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 0 PID 50418 RUNNING AT hbm01 = KILLED BY SIGNAL: 6 (Aborted)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 1 PID 50419 RUNNING AT hbm01 = KILLED BY SIGNAL: 6 (Aborted)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 2 PID 50420 RUNNING AT hbm01 = KILLED BY SIGNAL: 6 (Aborted)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 3 PID 50421 RUNNING AT hbm01 = KILLED BY SIGNAL: 6 (Aborted)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 4 PID 50422 RUNNING AT hbm01 = KILLED BY SIGNAL: 6 (Aborted)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 5 PID 50423 RUNNING AT hbm01 = KILLED BY SIGNAL: 6 (Aborted)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 6 PID 50424 RUNNING AT hbm01 = KILLED BY SIGNAL: 11 (Segmentation fault)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 7 PID 50425 RUNNING AT hbm01 = KILLED BY SIGNAL: 6 (Aborted)