Closed libinta closed 2 years ago
I cannot reproduce it, it works fine on my machine. I guess you're using version 1.4.0 of the SDK?
yes, we are using 1.4.0. can you share your gaudi_config.json, cmd, and run log?
I have not upgraded to 1.4.0 yet. Please ensure first that pytest tests/test_gaudi_configuration.py tests/test_trainer_distributed.py tests/test_trainer.py
runs without any failure. If not, try with 1.3.0.
My gaudi config is the following:
{
"use_habana_mixed_precision": true,
"hmp_opt_level": "O1",
"hmp_is_verbose": false,
"use_fused_adam": true,
"use_fused_clip_norm": true,
"log_device_mem_alloc": false
}
I ran the command you gave with this Gaudi configuration and --save_steps 5
.
Here is the log: roberta-large.log
Update: I tried on 1.3.0 and see that it's working fine also. The issue is from 1.4.0.
I see that some kind of hcl time out error when saving optimizer.
/home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::147(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs() failed with rc(-110) status(2) csid(395642) csHandle(560395)
And that roberta-large optmizer size is very big.
-rw-r--r-- 1 root root **2834713829** Apr 5 13:15 optimizer.pt
I tried https://github.com/HabanaAI/Model-References/blob/1.4.0/PyTorch/nlp/finetuning/huggingface/bert/transformers/src/transformers/trainer.py#L1766 just to see if saving small size data is working and see that it works fine. But we still need to save other optimizer info too.
When we tried to change the output directory to /tmp rather than example/question_answer , the issue is gone. @regisss when you said you tested on 1.3 docker, it works, which output directory you specificed?
I think it was ./roberta_large_8x_bf16_lazy
. Cannot be 100% sure because I cleant it up, but the only thing I changed was the path to the gaudi config.
Do you know if anything has changed in 1.4.0 regarding process synchronization in HCCL? Shall I call torch.distributed.barrier() before saving?
@regisss did you see any issue now? I didn't see any model use that before saving the checkpoint on R1.4. Also when we change to not save to the example/question_answer where run_qa is, the hcl issue is gone
I've been able to reproduce it. The fact that this does not occur in single-card training and that errors are a bit random confirms that it is a synchronization issue. It happens when a checkpoint is saved: some processes are accessing a file/folder that some others are deleting.
I just a pushed a fix in #20. Let me know if it works well now @libinta
Closing this issue as PR #20 was approved and merged.
With the following cmd roberta large failed at 8x
python ../gaudi_spawn.py --world_size 8 --use_mpi run_qa.py --model_name_or_path roberta-large --gaudi_config_name ../gaudi_config.json --dataset_name squad --do_train --do_eval --per_device_train_batch_size 12 --per_device_eval_batch_size 8 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir ./roberta_large_8x_bf16_lazy --use_habana --use_lazy_mode
to make the issue easier to reproduce: add the following cmd --save_steps 5
it's related to the save portion, need to find out which save configuration or checkpoint, tockenizer config, special tokens