huggingface / transformers

🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
https://huggingface.co/transformers
Apache License 2.0
132.94k stars 26.52k forks source link

`VisionTextDualEncoder`: Distributed training is always enabled #24924

Closed phiyodr closed 1 year ago

phiyodr commented 1 year ago

System Info

Who can help?

No response

Information

Tasks

Reproduction

Hi,

I'm running the unchanged "VisionTextDualEncoder and CLIP model training example" on my local laptop (which has 1 GPU) and wonder why it claims to do distributed training: True (and not False). From the output:

07/19/2023 15:21:22 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False

The above output originates from run_clip.py

    logger.warning(
        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
    )

My questions:


Bigger picture: Sometimes my training (on a cluster) hangs up in n-1 iteration and never finishes. I wonder if this has to do with distributed training. I don't know how to debug this.

100%|█████████▉| 2875/2876 [11:34<00:00,  4.10it/s]

Thanks in advance!

Expected behavior

I don't want to use distributed training, i.e. training_args.local_rank = -1

sgugger commented 1 year ago

How are you launching the training script? Could you share that part?

phiyodr commented 1 year ago

I use the unchanged code from the example:

python examples/pytorch/contrastive-image-text/run_clip.py \
    --output_dir ./clip-roberta-finetuned \
    --model_name_or_path ./clip-roberta \
    --data_dir $PWD/data \
    --dataset_name ydshieh/coco_dataset_script \
    --dataset_config_name=2017 \
    --image_column image_path \
    --caption_column caption \
    --remove_unused_columns=False \
    --do_train  --do_eval \
    --per_device_train_batch_size="64" \
    --per_device_eval_batch_size="64" \
    --learning_rate="5e-5" --warmup_steps="0" --weight_decay 0.1 \
    --overwrite_output_dir

I neither use python -m torch.distributed.launch ... nor things like accelerate launch .... Just pure python ... :)

Thank you in advance!

sgugger commented 1 year ago

That is really weird. @muellerzr could you have a look here to check we didn't mess something with the Accelerate integration in the Trainer?

muellerzr commented 1 year ago

This is fine, the scripts need to be updated however as checking local_rank != -1 is the wrong check to use after the accelerate integration. Will open a PR. You can confirm it's training on non-multi-GPU by adding the following to that warning:

    logger.warning(
        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
        + f'State: {training_args.distributed_state}'
    )

Which will print the accelerator state which has:

Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Like we expect 😄

muellerzr commented 1 year ago

All the examples are updated in #24956

phiyodr commented 1 year ago

Perfect! Thanks a lot for the clarification :+1: