About batch_size setting problem

TomasAndersonFang commented 1 year ago

Hi,

Thanks for your sharing.

I plan to re-run your shared incode fine-tuning code, but I met some problems when fine-tuning.

Specifically, I used your shared docker image to run the code, but my deep learning server contains 4 * 3090. When I directly run your code without change, everything is ok, but I find that only one GPU is working. I checked the source code and guessed it might be the reason for batch size (it equals to 1). So I used a large batch size, e.g., 4, but I got a ZeroDivisionError. Here is the detailed information:

model parameters: 1312063488
Token indices sequence length is longer than the specified maximum sequence length for this model (2905 > 2048). Running this sequence through the model will result in indexing errors
finish loading: 10000
finish loading: 20000
finish loading: 30000
finish loading: 40000
finish loading: 50000
finish loading: 60000
finish loading: 70000
finish loading: 80000
finish loading: 90000
finish loading: 100000
finish loading: 110000
finish loading: 120000
../../ft_data/ft_train.jsonl total size: 122847
../../ft_data/ft_eval.jsonl total size: 1000
/root/miniconda3/lib/python3.10/site-packages/torch/nn/parallel/_functions.py:68: UserWarning: Was asked to gather along dimension 0, but all input tensors were scalars; will instead unsqueeze and return a vector.
  warnings.warn('Was asked to gather along dimension 0, but all '
NCCL Error 2: unhandled system error
/root/miniconda3/lib/python3.10/site-packages/torch/optim/lr_scheduler.py:131: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  warnings.warn("Detected call of `lr_scheduler.step()` before `optimizer.step()`. "
Traceback (most recent call last):
  File "/home/CLM/clm-apr/incoder_finetune/finetune.py", line 135, in <module>
    fine_tune(
  File "/home/CLM/clm-apr/incoder_finetune/finetune.py", line 117, in fine_tune
    round(sum(training_loss) / len(training_loss), 4),
ZeroDivisionError: division by zero

Do you meet the same problem when fine-tuning? And, how shall I solve this problem?

jiang719 commented 1 year ago

Yes, this code is for training on 1 GPU, if you want to leverage all your GPUs, you can increase the beam size and wrap the model with torch.DataParallel.

TomasAndersonFang commented 1 year ago

@jiang719 Thanks, It works for me~

lin-tan / clm

About batch_size setting problem #1