`VisionTextDualEncoder`: Distributed training is always enabled

phiyodr commented 1 year ago

System Info

transformers version: 4.32.0.dev0
Platform: Linux-5.15.0-76-generic-x86_64-with-glibc2.31
Python version: 3.10.10
Huggingface_hub version: 0.14.1
Safetensors version: 0.3.1
Accelerate version: 0.21.0
Accelerate config: not found
PyTorch version (GPU?): 2.0.0+cu117 (True)
Tensorflow version (GPU?): 2.13.0 (False)
Flax version (CPU?/GPU?/TPU?): 0.7.0 (cpu)
Jax version: 0.4.13
JaxLib version: 0.4.13
Using GPU in script?: yes
Using distributed or parallel set-up in script?: It seems yes, but I don't want to ;)

Who can help?

No response

Information

[X] The official example scripts
[ ] My own modified scripts

Tasks

[X] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

Hi,

I'm running the unchanged "VisionTextDualEncoder and CLIP model training example" on my local laptop (which has 1 GPU) and wonder why it claims to do distributed training: True (and not False). From the output:

07/19/2023 15:21:22 - WARNING - __main__ - Process rank: 0, device: cuda:0, n_gpu: 1distributed training: True, 16-bits training: False

The above output originates from run_clip.py

    logger.warning(
        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
    )

The default should be training_args.local_rank=-1 according to TrainingArguments but is somehow set to 0 in this example and I don't know why.
Adding local_rank=-1 to the run_clip.py example script does not show any effect.

My questions:

Is it intended that local_rank is set to 0?
Does local_rank=0 really mean that distributed training in Trainer is enabled? (I'm new to Trainer and usually work with DistributedDataParallel)
How to switch off distributed training?

Bigger picture: Sometimes my training (on a cluster) hangs up in n-1 iteration and never finishes. I wonder if this has to do with distributed training. I don't know how to debug this.

100%|█████████▉| 2875/2876 [11:34<00:00,  4.10it/s]

Thanks in advance!

Expected behavior

I don't want to use distributed training, i.e. training_args.local_rank = -1

sgugger commented 1 year ago

How are you launching the training script? Could you share that part?

phiyodr commented 1 year ago

I use the unchanged code from the example:

python examples/pytorch/contrastive-image-text/run_clip.py \
    --output_dir ./clip-roberta-finetuned \
    --model_name_or_path ./clip-roberta \
    --data_dir $PWD/data \
    --dataset_name ydshieh/coco_dataset_script \
    --dataset_config_name=2017 \
    --image_column image_path \
    --caption_column caption \
    --remove_unused_columns=False \
    --do_train  --do_eval \
    --per_device_train_batch_size="64" \
    --per_device_eval_batch_size="64" \
    --learning_rate="5e-5" --warmup_steps="0" --weight_decay 0.1 \
    --overwrite_output_dir

I neither use python -m torch.distributed.launch ... nor things like accelerate launch .... Just pure python ... :)

Thank you in advance!

sgugger commented 1 year ago

That is really weird. @muellerzr could you have a look here to check we didn't mess something with the Accelerate integration in the Trainer?

muellerzr commented 1 year ago

This is fine, the scripts need to be updated however as checking local_rank != -1 is the wrong check to use after the accelerate integration. Will open a PR. You can confirm it's training on non-multi-GPU by adding the following to that warning:

    logger.warning(
        f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
        + f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
        + f'State: {training_args.distributed_state}'
    )

Which will print the accelerator state which has:

Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda

Like we expect 😄

muellerzr commented 1 year ago

All the examples are updated in #24956

phiyodr commented 1 year ago

Perfect! Thanks a lot for the clarification :+1:

huggingface / transformers