Closed AnthoJack closed 1 year ago
@AnthoJack, batch size of 4 may be too small to observe speedup from parallel training. As for model accuracy, we test the repro provided, our current observation is that Deepspeed matches the baseline within some margin of error (please see below). We recommend you update to a recent version of Deepspeed, train the model a bit longer and report back if there are still issues.
Pytorch baseline
Training duration for this epoch: 342s
Eval
Epoch status: loss = 0.9698298805744358, accuracy = 79.49606299212599% Test
Epoch status: loss = 0.9595162916080674, accuracy = 80.81889763779527%
Deepspeed Training duration for this epoch: 369s
Eval Training duration for this epoch: 370s
Eval Epoch status: loss = 0.8349409562236061, accuracy = 83.24409448818898%
Test Model status: loss = 0.8458209765867007, accuracy = 82.3325234293648%
I've uninstalled Deepspeed and reinstalled it so I could have the 0.8.0 version (latest at this time), bumped the batch_size to 12 instead of 4 and tried to train the model for 3 epochs instead of 1 but I still get similar result
Pure Pytorch version:
Epoch: 0 Training duration for this epoch: 107s Eval Epoch status: loss = 0.66783572167974, accuracy = 82.89763779527559%
Epoch: 1 Training duration for this epoch: 107s Eval Epoch status: loss = 0.5203173868318218, accuracy = 85.92125984251967%
Epoch: 2 Training duration for this epoch: 108s Eval Epoch status: loss = 0.5090007663288002, accuracy = 86.99212598425197%
Test Model status: loss = 0.5772417564914053, accuracy = 85.87296077750781%
Deepspeed version:
Epoch: 0 Training duration for this epoch: 161s Eval Epoch status: loss = 2.5807703462074416, accuracy = 55.43307086614173%
Epoch: 1 Training duration for this epoch: 161s Eval Epoch status: loss = 1.1971479654106987, accuracy = 75.33858267716536%
Epoch: 2 Training duration for this epoch: 162s Eval Epoch status: loss = 0.8527409949126346, accuracy = 80.22047244094487%
Test Model status: loss = 0.9118667843959549, accuracy = 78.20201318986463%
Could it be that the fact that I have 2 GPUs in the Deepspeed version cause problems during the aggregation step ? I've tried different versions of the same model but using one of Deepspeed's optimizers or trying to give Deepspeed a dataset so it can create its own dataloader instead of working with a given dataloader but results were always even worse
@AnthoJack, since we are unable to repro, is it okay to close for now?
Describe the bug I've created a program to train the Resnet50 model on the Caltech256 dataset using the Pytorch API and wanted to convert it to Deepspeed to hopefully take advantage of my 2 GPUs easily. However, when I tried to use it, not only did it take almost twice as long to train it but the resulting accuracy dropped from around 70% to around 40%
To Reproduce CaltechResnet.zip
Execute provided python files "CaltechResnet_torch.py" and "CaltechResnet_dp.py" and compare results
Expected behavior I realise that I do not let Deepspeed optimize much in this first version so I'm not as surprised about the fact that it takes more time to train using it but I didn't expect the accuracy results to be so much worse
ds_report output
DeepSpeed C++/CUDA extension op report
NOTE: Ops not installed will be just-in-time (JIT) compiled at runtime if needed. Op compatibility means that your system meet the required dependencies to JIT install the op.
JIT compiled ops requires ninja ninja .................. [OKAY]
op name ................ installed .. compatible
cpu_adam ............... [NO] ....... [OKAY] cpu_adagrad ............ [NO] ....... [OKAY] fused_adam ............. [NO] ....... [OKAY] fused_lamb ............. [NO] ....... [OKAY] [WARNING] please install triton==1.0.0 if you want to use sparse attention sparse_attn ............ [NO] ....... [NO] transformer ............ [NO] ....... [OKAY] stochastic_transformer . [NO] ....... [OKAY] async_io ............... [NO] ....... [OKAY] utils .................. [NO] ....... [OKAY] quantizer .............. [NO] ....... [OKAY] transformer_inference .. [NO] ....... [OKAY] spatial_inference ...... [NO] ....... [OKAY] random_ltd ............. [NO] ....... [OKAY]
DeepSpeed general environment info: torch install path ............... ['/home/tsdreds/Repositories/giotto-deep_AJD-YBT/venv/lib/python3.8/site-packages/torch'] torch version .................... 1.13.0+cu117 torch cuda version ............... 11.7 torch hip version ................ None nvcc version ..................... 11.4 deepspeed install path ........... ['/home/tsdreds/Repositories/DeepSpeed_YBT/deepspeed'] deepspeed info ................... 0.8.0+36b937da, 36b937da, fix-prototype-gelu-flops-compute deepspeed wheel compiled w. ...... torch 1.13, cuda 11.7
System info (please complete the following information): Ubuntu 20.04, 2x RTX 3090, Python 3.8.10
Launcher context Launch the "torch" version using python and the "dp" version using deepspeed
Any idea what may cause such differences ?