microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.5k stars 4.12k forks source link

[BUG]Out of memory when training, and is streaming mode supported ? #3214

Open wqw547243068 opened 1 year ago

wqw547243068 commented 1 year ago

description

question

The training process was always killed for OOM(Out Of Memory), while total size of dataset is 1.1GB, and smaller than maximum (320G)

cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/other_language
sh run_chinese.sh

I'm wondering if all the dataset is loaded in memory, which leads to OOM.

Do the loading process support streaming mode ?

tjruwase commented 1 year ago

@wqw547243068, can you please share ds_report output and stack trace?

Kayce001 commented 1 year ago

try set batchsize=1

chenyujie1127 commented 1 year ago

@wqw547243068, can you please share ds_report output and stack trace?

the ds_report output is attached below:


DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

[WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. async_io ............... [NO] ....... [NO] cpu_adagrad ............ [NO] ....... [OKAY] cpu_adam ............... [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] spatial_inference ...... [NO] ....... [OKAY] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/root/miniconda3/envs/deepspeed/lib/python3.7/site-packages/torch'] torch version .................... 1.11.0+cu113 deepspeed install path ........... ['/root/miniconda3/envs/deepspeed/lib/python3.7/site-packages/deepspeed'] deepspeed info ................... 0.9.0+970d827f, 970d827f, master torch cuda version ............... 11.3 torch hip version ................ None nvcc version ..................... 11.3 deepspeed wheel compiled w. ...... torch 0.0, cuda 0.0

xinj7 commented 1 year ago

description

  • datasets: 1.1 GB Chinese corpus, about 2 million lines
  • devices

    • CPU=8 GPU=1 Memory=320G Node=1 Type=A100-SXM-80GB

question

The training process was always killed for OOM(Out Of Memory), while total size of dataset is 1.1GB, and smaller than maximum (320G)

  • Program is stopped in the Tokenizing process
cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/other_language
sh run_chinese.sh

I'm wondering if all the dataset is loaded in memory, which leads to OOM.

Do the loading process support streaming mode ?

how long does it take for one epoch training for your hardware? 2 million lines could take long...

chenyujie1127 commented 1 year ago

description

  • datasets: 1.1 GB Chinese corpus, about 2 million lines
  • devices

    • CPU=8 GPU=1 Memory=320G Node=1 Type=A100-SXM-80GB

question

The training process was always killed for OOM(Out Of Memory), while total size of dataset is 1.1GB, and smaller than maximum (320G)

  • Program is stopped in the Tokenizing process
cd DeepSpeedExamples/applications/DeepSpeed-Chat/training/step1_supervised_finetuning/training_scripts/other_language
sh run_chinese.sh

I'm wondering if all the dataset is loaded in memory, which leads to OOM. Do the loading process support streaming mode ?

how long does it take for one epoch training for your hardware? 2 million lines could take long...

In my evaluation, it may take 20 hours/epoch in step 1 training, but i think the main problem is not in the training process. It's in the step of the tokenizer generation of training dataset, which is before torch.save(train_dataset,train_fnamet) stage. When I run this step, the memory is up to 320G(When i go into the training step, it takes 176G memory. ) Do you have any idea that how should I resolve this problem?