loubnabnl / santacoder-finetuning

Fine-tune SantaCoder for Code/Text Generation.
Apache License 2.0
186 stars 23 forks source link

I used multiple GPUs for training, but there were the following errors. How can I solve them? #15

Closed chen-lee-li closed 1 year ago

chen-lee-li commented 1 year ago

lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions

warnings.warn(
WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.

loubnabnl commented 1 year ago

If you didn't change the training code this command doesn't raise the issue

python -m torch.distributed.launch \
        --nproc_per_node number_of_gpus train.py \
        --model_path="bigcode/santacoder" \
        --dataset_name="bigcode/the-stack-dedup" \
        --subset="data/shell" \
        --data_column "content" \
        --split="train" \
        --seq_length 2048 \
        --max_steps 30000 \
        --batch_size 2 \
        --gradient_accumulation_steps 8 \
        --learning_rate 5e-5 \
        --num_warmup_steps 500 \
        --eval_freq 3000 \
        --save_freq 3000 \
        --log_freq 1 \
        --num_workers="$(nproc)" \

If the issue persists can you provide details about your command and library versions?