Open wqw547243068 opened 1 year ago
@wqw547243068, can you please share ds_report
output and stack trace?
try set batchsize=1
@wqw547243068, can you please share
ds_report
output and stack trace?
the ds_report output is attached below:
DeepSpeed general environment info: torch install path ............... ['/root/miniconda3/envs/deepspeed/lib/python3.7/site-packages/torch'] torch version .................... 1.11.0+cu113 deepspeed install path ........... ['/root/miniconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed'] deepspeed info ................... 0.9.0+970d827f, 970d827f, master torch cuda version ............... 11.3 torch hip version ................ None nvcc version ..................... 11.3 deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0
description
- datasets: 1.1 GB Chinese corpus, about 2 million lines
devices
- CPU=8 GPU=1 Memory=320G Node=1 Type=A100-SXM-80GB
question
The training process was always killed for OOM(Out Of Memory), while total size of dataset is 1.1GB, and smaller than maximum (320G)
- Program is stopped in the Tokenizing process
cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/other_language sh run_chinese.sh
I'm wondering if all the dataset is loaded in memory, which leads to OOM.
Do the loading process support streaming mode ?
how long does it take for one epoch training for your hardware? 2 million lines could take long...
description
- datasets: 1.1 GB Chinese corpus, about 2 million lines
devices
- CPU=8 GPU=1 Memory=320G Node=1 Type=A100-SXM-80GB
question
The training process was always killed for OOM(Out Of Memory), while total size of dataset is 1.1GB, and smaller than maximum (320G)
- Program is stopped in the Tokenizing process
cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/other_language sh run_chinese.sh
I'm wondering if all the dataset is loaded in memory, which leads to OOM. Do the loading process support streaming mode ?
how long does it take for one epoch training for your hardware? 2 million lines could take long...
In my evaluation, it may take 20 hours/epoch in step 1 training, but i think the main problem is not in the training process. It's in the step of the tokenizer generation of training dataset, which is before torch.save(train_dataset,train_fnamet) stage. When I run this step, the memory is up to 320G(When i go into the training step, it takes 176G memory. ) Do you have any idea that how should I resolve this problem?
description
question
The training process was always killed for OOM(Out Of Memory), while total size of dataset is 1.1GB, and smaller than maximum (320G)
I'm wondering if all the dataset is loaded in memory, which leads to OOM.
Do the loading process support streaming mode ?