The loss for training and eval is not plotted per epoch on the file created for tensorboard visualization and on the ReadMe file created to summarize results, instead of training and eval loss there is a "no log" showing.
--report_to tensorboard is for other values usable. Is this a known issue?
Also, is there already some work going on related to this flag? I've seen in the code that we rely directly on HF code, not on optimum-habana.
Adding option “--logging_strategy epoch” adds datapoints and fixes the issue of “no log “ in the readme. And for plotting per epoch instead of step needs change in transformers.
System Info
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
HABANA_VISIBLE_MODULES="2,3,4,5" python ../gaudi_spawn.py --world_size 4 run_clm.py --model_name_or_path google/gemma-2b-it --per_device_train_batch_size 16 --per_device_eval_batch_size 16 --dataset_name mamamiya405/finred --do_train --do_eval --output_dir ./test/4hpu_16bs_5e --gaudi_config_name Habana/gpt2 --use_habana --gradient_checkpointing --use_hpu_graphs_for_inference --throughput_warmup_steps 3 --bf16 --evaluation_strategy epoch --save_total_limit 1 --num_train_epochs 5 --report_to tensorboard --profiling_warmup_steps 0
Expected behavior
The loss for training and eval is not plotted per epoch on the file created for tensorboard visualization and on the ReadMe file created to summarize results, instead of training and eval loss there is a "no log" showing.
--report_to tensorboard is for other values usable. Is this a known issue? Also, is there already some work going on related to this flag? I've seen in the code that we rely directly on HF code, not on optimum-habana.