Open tjruwase opened 1 year ago
for (4)th item training and comparing the loss curve you can probably use HF Trainer + pytorch example program, so everything is already done for you - e.g. see the example of how I train from scratch opt-1.3b https://github.com/huggingface/transformers/pull/21312 -
so this will generate a tensorboard and if you add --skip_memory_metrics 0
it'll even print you detailed summary of allocated vs peak memory for each stage.
so just need to add a set of ds_config files with auto
values
I can help set this one up. It should be very trivial to do. as it's really just adding on of --fp16
or --bf16
or nothing.
And also using 3 set ups w/o deepspeed - just using DDP, so that would be another baseline for each of the dtypes.
to summarize,
Actually I forgot I developed a whole tool to do grid search / matrix of options runs: https://github.com/huggingface/transformers/blob/main/scripts/benchmark/trainer-benchmark.py
CUDA_VISIBLE_DEVICES=0 python ./scripts/benchmark/trainer-benchmark.py \
--base-cmd \
' examples/pytorch/translation/run_translation.py --model_name_or_path t5-small \
--output_dir output_dir --do_train --label_smoothing 0.1 --logging_strategy no \
--save_strategy no --per_device_train_batch_size 32 --max_source_length 512 \
--max_target_length 512 --num_train_epochs 1 --overwrite_output_dir \
--source_lang en --target_lang ro --dataset_name wmt16 --dataset_config "ro-en" \
--source_prefix "translate English to Romanian: " --warmup_steps 50 \
--max_train_samples 20000 --dataloader_num_workers 2 ' \
--target-metric-key train_samples_per_second --repeat-times 1
--report-metric-keys train_loss --repeat-times 1 --base-variation '--tf32 0' \
--variations '|--fp16|--bf16' '--tf32 0|--tf32 1'
the last line demonstrates an example of 6 variations it will run using the same base script: fp16/bf16/fp32 vs tf32(on/off) = 3*2 = 6 variations.
this may or may not be easier to use - not sure, but we have plenty of working out of the box choices.
I think you just need to find a resource allocation for that and we can set up these jobs very quickly.
Then tensorboard all the results into the same base directory --report_to tensorboard -- logging_dir some_path
and it'll produce a very easy to run tensorboard --logdir some_path
and review the outcome. Perhaps this can be automated as well and the graphs emailed somewhere or posted to Slack or Teams.
hi teams, any updates?
Hey folks, is this still an active issue? I'm observing some differences in training between zero2 and zero3 using Llama models with the fixed rotary embedding cache init (https://github.com/microsoft/DeepSpeed/issues/4932#issuecomment-1900929748).
As you can see from the discussion you linked to there will be no equivalency in that particular case of LLama-2 due to how the buffers are created. I urge you to file an Issue with HF Transformers and ask them to distribute the correct buffers with the model weights and not leave them to be recalculated at model init time.
We have the scripts to compare the DeepSpeed's results with PyTorch. https://github.com/tohtana/validate_zero The mixed precision support is limited but we can start here.
Starting point: https://github.com/microsoft/DeepSpeed/issues/966
Test matrix
gpus: one vs many
@stas00