huggingface / optimum-habana

Easy and lightning fast training of 🤗 Transformers on Habana Gaudi processor (HPU)
Apache License 2.0
152 stars 198 forks source link

Robert large 8x run failed #15

Closed libinta closed 2 years ago

libinta commented 2 years ago

With the following cmd roberta large failed at 8x

python ../gaudi_spawn.py --world_size 8 --use_mpi run_qa.py --model_name_or_path roberta-large --gaudi_config_name ../gaudi_config.json --dataset_name squad --do_train --do_eval --per_device_train_batch_size 12 --per_device_eval_batch_size 8 --learning_rate 3e-5 --num_train_epochs 2 --max_seq_length 384 --doc_stride 128 --output_dir ./roberta_large_8x_bf16_lazy --use_habana --use_lazy_mode

to make the issue easier to reproduce: add the following cmd --save_steps 5

it's related to the save portion, need to find out which save configuration or checkpoint, tockenizer config, special tokens

regisss commented 2 years ago

I cannot reproduce it, it works fine on my machine. I guess you're using version 1.4.0 of the SDK?

libinta commented 2 years ago

yes, we are using 1.4.0. can you share your gaudi_config.json, cmd, and run log?

regisss commented 2 years ago

I have not upgraded to 1.4.0 yet. Please ensure first that pytest tests/test_gaudi_configuration.py tests/test_trainer_distributed.py tests/test_trainer.py runs without any failure. If not, try with 1.3.0.

My gaudi config is the following:

{
  "use_habana_mixed_precision": true,
  "hmp_opt_level": "O1",
  "hmp_is_verbose": false,
  "use_fused_adam": true,
  "use_fused_clip_norm": true,
  "log_device_mem_alloc": false
}

I ran the command you gave with this Gaudi configuration and --save_steps 5.

Here is the log: roberta-large.log

yeonsily commented 2 years ago

Update: I tried on 1.3.0 and see that it's working fine also. The issue is from 1.4.0.

I see that some kind of hcl time out error when saving optimizer. /home/jenkins/workspace/cdsoftwarebuilder/create-binaries-from-sw-sources---bp-dt/repos/hcl/src/platform/gaudi/infra/hcl_command_submission.cpp::147(onWatcher): The condition [ rc == 0 ] failed. hlthunk_wait_for_cs() failed with rc(-110) status(2) csid(395642) csHandle(560395)

And that roberta-large optmizer size is very big. -rw-r--r-- 1 root root **2834713829** Apr 5 13:15 optimizer.pt

I tried https://github.com/HabanaAI/Model-References/blob/1.4.0/PyTorch/nlp/finetuning/huggingface/bert/transformers/src/transformers/trainer.py#L1766 just to see if saving small size data is working and see that it works fine. But we still need to save other optimizer info too.

libinta commented 2 years ago

When we tried to change the output directory to /tmp rather than example/question_answer , the issue is gone. @regisss when you said you tested on 1.3 docker, it works, which output directory you specificed?

regisss commented 2 years ago

I think it was ./roberta_large_8x_bf16_lazy. Cannot be 100% sure because I cleant it up, but the only thing I changed was the path to the gaudi config.

Do you know if anything has changed in 1.4.0 regarding process synchronization in HCCL? Shall I call torch.distributed.barrier() before saving?

libinta commented 2 years ago

@regisss did you see any issue now? I didn't see any model use that before saving the checkpoint on R1.4. Also when we change to not save to the example/question_answer where run_qa is, the hcl issue is gone

regisss commented 2 years ago

I've been able to reproduce it. The fact that this does not occur in single-card training and that errors are a bit random confirms that it is a synchronization issue. It happens when a checkpoint is saved: some processes are accessing a file/folder that some others are deleting.

I just a pushed a fix in #20. Let me know if it works well now @libinta

regisss commented 2 years ago

Closing this issue as PR #20 was approved and merged.