huggingface / optimum-habana

Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)
Apache License 2.0
142 stars 170 forks source link

--report_to tensorboard not working on multiple HPUs #1021

Closed 12010486 closed 2 months ago

12010486 commented 3 months ago

System Info

Optimum-habana main branch, at the commit 8863f1cc2be695a59673fc8a8095e25101a45f3f
SW hl-1.15.0

Information

Tasks

Reproduction

HABANA_VISIBLE_MODULES="2,3,4,5" python ../gaudi_spawn.py --world_size 4 run_clm.py --model_name_or_path google/gemma-2b-it --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --dataset_name mamamiya405/finred --do_train --do_eval --output_dir ./test/4hpu_16bs_5e --gaudi_config_name Habana/gpt2 --use_habana --gradient_checkpointing --use_hpu_graphs_for_inference --throughput_warmup_steps 3 --bf16 --evaluation_strategy epoch --save_total_limit 1 --num_train_epochs 5 --report_to tensorboard --profiling_warmup_steps 0

Expected behavior

The loss for training and eval is not plotted per epoch on the file created for tensorboard visualization and on the ReadMe file created to summarize results, instead of training and eval loss there is a "no log" showing.

--report_to tensorboard is for other values usable. Is this a known issue? Also, is there already some work going on related to this flag? I've seen in the code that we rely directly on HF code, not on optimum-habana.

12010486 commented 2 months ago

Adding option “--logging_strategy epoch” adds datapoints and fixes the issue of “no log “ in the readme. And for plotting per epoch instead of step needs change in transformers.