Thank you for the wonderful tutorial. These days I've been using your codes to achieve finetuning on Flan-T5-xxl with my own corpus using DeepSpeed as shown here by you.
I am using the same configurations as what is given in your codes, e.g., 8*A100, 100 CPU cores, 500 Mem. And I prepared my datasets, and got everything ready.
I've almost done everything well, however still one step away from success:
When I execute '_deepspeed --num_gpus=8 scripts/run_seq2seqdeepspeed.py', it seems everything goes fine initially.
But soon when the 8 GPUs start working (loaded in with data), the process has immediately been terminated, killed.
I am pretty sure that all the environment requirements are satisfied as you noted there (8 batch size). But I am not sure where is the problem at. I don't understand the meaning of error code -7 (as in the last line above), even after searching the whole Internet.
I am not knowledgeable about the underlying mechanism of GPU parallelism. And from your viewpoint, what could be the most probable factors?
Hi Philipp @philschmid ,
Thank you for the wonderful tutorial. These days I've been using your codes to achieve finetuning on Flan-T5-xxl with my own corpus using DeepSpeed as shown here by you. I am using the same configurations as what is given in your codes, e.g., 8*A100, 100 CPU cores, 500 Mem. And I prepared my datasets, and got everything ready.
I've almost done everything well, however still one step away from success: When I execute '_deepspeed --num_gpus=8 scripts/run_seq2seqdeepspeed.py', it seems everything goes fine initially. But soon when the 8 GPUs start working (loaded in with data), the process has immediately been terminated, killed.
I am pretty sure that all the environment requirements are satisfied as you noted there (8 batch size). But I am not sure where is the problem at. I don't understand the meaning of error code -7 (as in the last line above), even after searching the whole Internet. I am not knowledgeable about the underlying mechanism of GPU parallelism. And from your viewpoint, what could be the most probable factors?