microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.02k stars 4.06k forks source link

[BUG] Subpar time and accuracy performance when using Deepspeed compared to pure Pytorch #2724

Closed AnthoJack closed 1 year ago

AnthoJack commented 1 year ago

Describe the bug I've created a program to train the Resnet50 model on the Caltech256 dataset using the Pytorch API and wanted to convert it to Deepspeed to hopefully take advantage of my 2 GPUs easily. However, when I tried to use it, not only did it take almost twice as long to train it but the resulting accuracy dropped from around 70% to around 40%

To Reproduce CaltechResnet.zip

Execute provided python files "CaltechResnet_torch.py" and "CaltechResnet_dp.py" and compare results

Expected behavior I realise that I do not let Deepspeed optimize much in this first version so I'm not as surprised about the fact that it takes more time to train using it but I didn't expect the accuracy results to be so much worse

ds_report output

DeepSpeed C++/CUDA extension op report

NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.

JIT compiled ops requires ninja ninja .................. [OKAY]

op name ................ installed .. compatible

cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] async_io ............... [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] spatial_inference ...... [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY]

DeepSpeed general environment info: torch install path ............... ['/home/tsdreds/Repositories/giotto-deep_AJD-YBT/venv/lib/python3.8/site-packages/torch'] torch version .................... 1.13.0+cu117 torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.4 deepspeed install path ........... ['/home/tsdreds/Repositories/DeepSpeed_YBT/deepspeed'] deepspeed info ................... 0.8.0+36b937da, 36b937da, fix-prototype-gelu-flops-compute deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7

System info (please complete the following information): Ubuntu 20.04, 2x RTX 3090, Python 3.8.10

Launcher context Launch the "torch" version using python and the "dp" version using deepspeed

Any idea what may cause such differences ?

samadejacobs commented 1 year ago

@AnthoJack, batch size of 4 may be too small to observe speedup from parallel training. As for model accuracy, we test the repro provided, our current observation is that Deepspeed matches the baseline within some margin of error (please see below). We recommend you update to a recent version of Deepspeed, train the model a bit longer and report back if there are still issues.

Pytorch baseline

Training duration for this epoch: 342s

Eval

Epoch status: loss = 0.9698298805744358, accuracy = 79.49606299212599% Test

Epoch status: loss = 0.9595162916080674, accuracy = 80.81889763779527%

Deepspeed Training duration for this epoch: 369s

Eval Training duration for this epoch: 370s

Eval Epoch status: loss = 0.8349409562236061, accuracy = 83.24409448818898%

Test Model status: loss = 0.8458209765867007, accuracy = 82.3325234293648%

AnthoJack commented 1 year ago

I've uninstalled Deepspeed and reinstalled it so I could have the 0.8.0 version (latest at this time), bumped the batch_size to 12 instead of 4 and tried to train the model for 3 epochs instead of 1 but I still get similar result

Pure Pytorch version:

Epoch: 0 Training duration for this epoch: 107s Eval Epoch status: loss = 0.66783572167974, accuracy = 82.89763779527559%

Epoch: 1 Training duration for this epoch: 107s Eval Epoch status: loss = 0.5203173868318218, accuracy = 85.92125984251967%

Epoch: 2 Training duration for this epoch: 108s Eval Epoch status: loss = 0.5090007663288002, accuracy = 86.99212598425197%

Test Model status: loss = 0.5772417564914053, accuracy = 85.87296077750781%

Deepspeed version:

Epoch: 0 Training duration for this epoch: 161s Eval Epoch status: loss = 2.5807703462074416, accuracy = 55.43307086614173%

Epoch: 1 Training duration for this epoch: 161s Eval Epoch status: loss = 1.1971479654106987, accuracy = 75.33858267716536%

Epoch: 2 Training duration for this epoch: 162s Eval Epoch status: loss = 0.8527409949126346, accuracy = 80.22047244094487%

Test Model status: loss = 0.9118667843959549, accuracy = 78.20201318986463%

Could it be that the fact that I have 2 GPUs in the Deepspeed version cause problems during the aggregation step ? I've tried different versions of the same model but using one of Deepspeed's optimizers or trying to give Deepspeed a dataset so it can create its own dataloader instead of working with a given dataloader but results were always even worse

tjruwase commented 1 year ago

@AnthoJack, since we are unable to repro, is it okay to close for now?