nlpxucan / WizardLM

LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath
9.25k stars 716 forks source link

launch.py:428:sigkill_handler] Killing subprocess 31472 #93

Open lionday opened 1 year ago

lionday commented 1 year ago

When training wizardlmcoder, running according to the training instructions results in an error.(My deivce is A100*4) Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N) ninja: no work to do. Loading extension module utils... Time to load utils op: 0.4519040584564209 seconds Loading extension module utils... Time to load utils op: 0.5020363330841064 seconds Loading extension module utils... Time to load utils op: 0.4016125202178955 seconds Loading extension module utils... Time to load utils op: 0.5018815994262695 seconds Parameter Offload: Total persistent parameters: 2725888 in 322 params [2023-06-25 10:47:37,222] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31472 [2023-06-25 10:47:48,634] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31473 [2023-06-25 10:48:00,042] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31474 [2023-06-25 10:48:00,042] [INFO] [launch.py:428:sigkill_handler] Killing subprocess 31475

ChiYeungLaw commented 1 year ago

Since Allowing ninja...31475 does not contain the error information, it is hard for me to give you some help. I think you need to make sure you install the environment correctly.

iawen commented 1 year ago

I've had a similar problem, you can check the system's logs like the dmesg command