Training of GPT2 hang during Checkpoint stage #28662

Closed jchauhan closed 6 months ago

jchauhan commented 8 months ago

System Info


- `transformers` version: 4.38.0.dev0
- Platform: Linux-5.4.0-1043-gcp-x86_64-with-glibc2.31
- Python version: 3.10.0
- Huggingface_hub version: 0.20.3
- Safetensors version: 0.4.2
- Accelerate version: 0.26.1
- Accelerate config:    not found
- PyTorch version (GPU?): 2.1.2+cu121 (False)
- Tensorflow version (GPU?): not installed (NA)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Using GPU in script?: TPU
- Using distributed or parallel set-up in script?: xla_spwn script

GCP TPU v2.8 Architecture

Libraries installed

Who can help?

text models: @ArthurZucker and @younesbelkada trainer: @muellerzr and @pacman100




  1. Procure a GCP TPU v2.8 VM
  2. Setup Transformer in a virtual env
  3. run the training command similar to below
python ./transformers/examples/pytorch/ --num_cores 8  ./transformers/examples/pytorch/language-modeling/ --model_name_or_path "gpt2" \
    --train_file data.txt \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --do_train \
    --output_dir my-gpt \
    --overwrite_output_dir \
    --log_level debug \
    --save_steps 1000 \
    --cache_dir ./cache/ \
    --num_train_epochs 40

Expected behavior

The trained model and checkpoint should be complete within a reasonable time of 15 mins. The training takes 5 mins however, checkpointing and saving model does not complete

ArthurZucker commented 8 months ago

Would recommend you to check this #26724 and try the solution, might be that or if the saving does not work, concurrency there. Code was recently changed cc @muellerzr 🤗

