microsoft / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
35.5k stars 4.12k forks source link

[BUG] Create zero equivalency unit test #2790

Open tjruwase opened 1 year ago

tjruwase commented 1 year ago

Starting point: https://github.com/microsoft/DeepSpeed/issues/966

Test matrix

  1. gradient accumulation: one vs many
  2. gpus: one vs many

  3. stages: 1 vs 2 vs 3
  4. dtype: bf16 vs fp16 vs fp32

@stas00

stas00 commented 1 year ago

for (4)th item training and comparing the loss curve you can probably use HF Trainer + pytorch example program, so everything is already done for you - e.g. see the example of how I train from scratch opt-1.3b https://github.com/huggingface/transformers/pull/21312 - so this will generate a tensorboard and if you add --skip_memory_metrics 0 it'll even print you detailed summary of allocated vs peak memory for each stage.

so just need to add a set of ds_config files with auto values

I can help set this one up. It should be very trivial to do. as it's really just adding on of --fp16 or --bf16 or nothing.

And also using 3 set ups w/o deepspeed - just using DDP, so that would be another baseline for each of the dtypes.


to summarize,

  1. part of this project is to do pretty precise tests with some tolerance in some dimensions
  2. actual training where we can see the real outcome - which is very visual to see if something is not training well.
stas00 commented 1 year ago

Actually I forgot I developed a whole tool to do grid search / matrix of options runs: https://github.com/huggingface/transformers/blob/main/scripts/benchmark/trainer-benchmark.py

CUDA_VISIBLE_DEVICES=0 python ./scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' examples/pytorch/translation/run_translation.py --model_name_or_path t5-small \
--output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \
--save_strategy no --per_device_train_batch_size 32 --max_source_length 512 \
--max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 20000 --dataloader_num_workers 2 ' \
--target-metric-key train_samples_per_second --repeat-times 1
--report-metric-keys train_loss --repeat-times 1 --base-variation '--tf32 0' \
--variations '|--fp16|--bf16' '--tf32 0|--tf32 1'  

the last line demonstrates an example of 6 variations it will run using the same base script: fp16/bf16/fp32 vs tf32(on/off) = 3*2 = 6 variations.

this may or may not be easier to use - not sure, but we have plenty of working out of the box choices.

I think you just need to find a resource allocation for that and we can set up these jobs very quickly.

Then tensorboard all the results into the same base directory --report_to tensorboard -- logging_dir some_path and it'll produce a very easy to run tensorboard --logdir some_path and review the outcome. Perhaps this can be automated as well and the graphs emailed somewhere or posted to Slack or Teams.

xingchensong commented 1 year ago

hi teams, any updates?

brianyu-nexusflowai commented 10 months ago

Hey folks, is this still an active issue? I'm observing some differences in training between zero2 and zero3 using Llama models with the fixed rotary embedding cache init (https://github.com/microsoft/DeepSpeed/issues/4932#issuecomment-1900929748).

stas00 commented 10 months ago

As you can see from the discussion you linked to there will be no equivalency in that particular case of LLama-2 due to how the buffers are created. I urge you to file an Issue with HF Transformers and ask them to distribute the correct buffers with the model weights and not leave them to be recalculated at model init time.

tohtana commented 1 month ago

We have the scripts to compare the DeepSpeed's results with PyTorch. https://github.com/tohtana/validate_zero The mixed precision support is limited but we can start here.