I plan to re-run your shared incode fine-tuning code, but I met some problems when fine-tuning.
Specifically, I used your shared docker image to run the code, but my deep learning server contains 4 * 3090. When I directly run your code without change, everything is ok, but I find that only one GPU is working. I checked the source code and guessed it might be the reason for batch size (it equals to 1). So I used a large batch size, e.g., 4, but I got a ZeroDivisionError. Here is the detailed information:
model parameters: 1312063488
Token indices sequence length is longer than the specified maximum sequence length for this model (2905 > 2048). Running this sequence through the model will result in indexing errors
finish loading: 10000
finish loading: 20000
finish loading: 30000
finish loading: 40000
finish loading: 50000
finish loading: 60000
finish loading: 70000
finish loading: 80000
finish loading: 90000
finish loading: 100000
finish loading: 110000
finish loading: 120000
../../ft_data/ft_train.jsonl total size: 122847
../../ft_data/ft_eval.jsonl total size: 1000
/root/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
warnings.warn('Was asked to gather along dimension 0, but all '
NCCL Error 2: unhandled system error
/root/miniconda3/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`. Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Traceback (most recent call last):
File "/home/CLM/clm-apr/incoder_finetune/finetune.py", line 135, in <module>
fine_tune(
File "/home/CLM/clm-apr/incoder_finetune/finetune.py", line 117, in fine_tune
round(sum(training_loss) / len(training_loss), 4),
ZeroDivisionError: division by zero
Do you meet the same problem when fine-tuning? And, how shall I solve this problem?
Yes, this code is for training on 1 GPU, if you want to leverage all your GPUs, you can increase the beam size and wrap the model with torch.DataParallel.
Hi,
Thanks for your sharing.
I plan to re-run your shared incode fine-tuning code, but I met some problems when fine-tuning.
Specifically, I used your shared docker image to run the code, but my deep learning server contains 4 * 3090. When I directly run your code without change, everything is ok, but I find that only one GPU is working. I checked the source code and guessed it might be the reason for batch size (it equals to 1). So I used a large batch size, e.g., 4, but I got a
ZeroDivisionError
. Here is the detailed information:Do you meet the same problem when fine-tuning? And, how shall I solve this problem?