DDP is not working - Githubissues

siebeniris commented 8 months ago

Hi, I would love to make the DDP work for training. But I directly got this warning by running the program:

12/11/2023 16:32:07 - WARNING - vec2text.experiments - Process rank: 0, device: cuda:0, n_gpu: 1, fp16 training: False, bf16 training: False

The following is my bash script for running DDP in cloud with singularity container, is there something I should have setup but I didn't ? Any help would be appreciated.

#!/bin/bash

LANG=$1
MODEL=$2
EMBEDDER=$3
DATASET=$4
EXP_GROUP_NAME=$5
EPOCH=$6
BATCH_SIZE=$7
MAX_LENGTH=$8

export NCCL_P2P_LEVEL=NVL
echo "language $LANG"
echo "model $MODEL"
echo "embedder $EMBEDDER"
echo "dataset $DATASET"
echo "exp_group_name $EXP_GROUP_NAME"
echo "epochs $EPOCH"
echo "batch size $BATCH_SIZE"
echo "max length $MAX_LENGTH"

echo "nvidia"
nvidia-smi

torchrun -m vec2text.run --overwrite_output_dir --per_device_train_batch_size ${BATCH_SIZE} --per_device_eval_batch_size ${BATCH_SIZE} --max_seq_length ${MAX_LENGTH} --model_name_or_path ${MODEL} --dataset_name ${DATASET} --embedder_model_name ${EMBEDDER} --num_repeat_tokens 16 --embedder_no_grad True --num_train_epochs ${EPOCH} --max_eval_samples 500 --eval_steps 20000 --warmup_steps 10000 --use_frozen_embeddings_as_input True --experiment inversion --lr_scheduler_type constant_with_warmup --exp_group_name ${EXP_GROUP_NAME} --learning_rate 0.001 --output_dir ./saves/inverters/${DATASET}_${LANG} --save_steps 2000 --use_wandb 1 --ddp_find_unused_parameters True

--gres=gpu:a40:2 is set using singularity container.

jxmorris12 commented 8 months ago

Hi. I don't think this is an error with vec2text. It looks like either (1) you're launching DDP with only one GPU [via torchrun?] or (2) that print statement happens from within the training process, which in your setting should be executed in two concurrent processes each with one GPU.

siebeniris commented 8 months ago

Hi, I think I did a mistake by using setting find_unused_parameters=True . DDP does work well for precomputing hypotheses. Thanks!

jxmorris12 / vec2text

DDP is not working #19