Closed phiyodr closed 1 year ago
How are you launching the training script? Could you share that part?
I use the unchanged code from the example:
python examples/pytorch/contrastive-image-text/run_clip.py \
--output_dir ./clip-roberta-finetuned \
--model_name_or_path ./clip-roberta \
--data_dir $PWD/data \
--dataset_name ydshieh/coco_dataset_script \
--dataset_config_name=2017 \
--image_column image_path \
--caption_column caption \
--remove_unused_columns=False \
--do_train --do_eval \
--per_device_train_batch_size="64" \
--per_device_eval_batch_size="64" \
--learning_rate="5e-5" --warmup_steps="0" --weight_decay 0.1 \
--overwrite_output_dir
I neither use python -m torch.distributed.launch ...
nor things like accelerate launch ...
.
Just pure python ...
:)
Thank you in advance!
That is really weird. @muellerzr could you have a look here to check we didn't mess something with the Accelerate integration in the Trainer?
This is fine, the scripts need to be updated however as checking local_rank != -1
is the wrong check to use after the accelerate integration. Will open a PR. You can confirm it's training on non-multi-GPU by adding the following to that warning:
logger.warning(
f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
+ f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
+ f'State: {training_args.distributed_state}'
)
Which will print the accelerator state which has:
Distributed environment: NO
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda
Like we expect 😄
All the examples are updated in #24956
Perfect! Thanks a lot for the clarification :+1:
System Info
transformers
version: 4.32.0.dev0Who can help?
No response
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Hi,
I'm running the unchanged "VisionTextDualEncoder and CLIP model training example" on my local laptop (which has 1 GPU) and wonder why it claims to do
distributed training: True
(and notFalse
). From the output:The above output originates from
run_clip.py
training_args.local_rank=-1
according toTrainingArguments
but is somehow set to0
in this example and I don't know why.local_rank=-1
to the run_clip.py example script does not show any effect.My questions:
local_rank
is set to0
?local_rank=0
really mean that distributed training inTrainer
is enabled? (I'm new toTrainer
and usually work withDistributedDataParallel
)Bigger picture: Sometimes my training (on a cluster) hangs up in n-1 iteration and never finishes. I wonder if this has to do with distributed training. I don't know how to debug this.
Thanks in advance!
Expected behavior
I don't want to use distributed training, i.e.
training_args.local_rank = -1