Open luckyfish0826 opened 1 year ago
update: both pop same error in 0.2.3 and 0.2.5
Do you have gradient accumulation steps larger than your dataset size?
Do you have gradient accumulation steps larger than your dataset size?
not quite sure about this. In my case, I changed nothing but the dummy.json file. Seems there is a minimum conversation count required, after testing, we found it's about 100. really wired.
oh, I met the same problem before. But I found it's because I use a small dataset and set a big accumulate step larger than dataset size. It become normal after I change the accumulate step. Maybe your conditional is similar. Hope this can inspire you!
thank you, I am new to LLM. I basically understand your point, I'll try and see what's coming
at first we edit the dummy.json file, changed the "my name is Vicuna" as "my name is XXXXX", and keep all the other conversations (total 910) , then trained it, the new model works fine in English output, by failed when we asked it with other languages.
so in order to find out the problem, we made the same change and leave only the 45 conversations about "who are you" (delete other 865 conversations), then trained it. This time we faced below error message:
RuntimeError: The size of tensor a (32768512) must match the size of tensor b (262148096) at non-singleton dimension 0
all the other detail traceback is below. Any one can help?
Not sure whether this belongs to an issue, yet we could not find better place to resolve this problem.
2023-05-08 10:33:13.000 [INFO] [Driver] Traceback (most recent call last): 2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/source/FastChat/fastchat/train/train_mem.py"", line 13, in"
2023-05-08 10:33:13.000 [INFO] [Driver] train()
2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/source/FastChat/fastchat/train/train.py"", line 245, in train"
2023-05-08 10:33:13.000 [INFO] [Driver] trainer.train()
2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/transformers/trainer.py"", line 1662, in train"
2023-05-08 10:33:13.000 [INFO] [Driver] return inner_training_loop(
2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/transformers/trainer.py"", line 1996, in _inner_training_loop"
2023-05-08 10:33:13.000 [INFO] [Driver] self.optimizer.step()
2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim/lr_scheduler.py"", line 68, in wrapper"
2023-05-08 10:33:13.000 [INFO] [Driver] return wrapped(*args, kwargs)"
2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim/optimizer.py"", line 140, in wrapper"
2023-05-08 10:33:13.000 [INFO] [Driver] out = func(*args, *kwargs)"
2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/autograd/grad_mode.py"", line 27, in decorate_context"
2023-05-08 10:33:13.000 [INFO] [Driver] return func(args, kwargs)"
2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim/adamw.py"", line 162, in step"
2023-05-08 10:33:13.000 [INFO] [Driver] adamw(params_with_grad,"
2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim/adamw.py"", line 219, in adamw"
2023-05-08 10:33:13.000 [INFO] [Driver] func(params,"
2023-05-08 10:33:13.000 [INFO] [Driver] File ""/home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim/adamw.py"", line 273, in _single_tensor_adamw"
2023-05-08 10:33:13.000 [INFO] [Driver] expavg.mul(beta1).add_(grad, alpha=1 - beta1)"
2023-05-08 10:33:13.000 [INFO] [Driver] RuntimeError: The size of tensor a (32768512) must match the size of tensor b (262148096) at non-singleton dimension 0
2023-05-08 10:33:14.000 [INFO] [Driver] ╭───────────────────── Traceback (most recent call last) ──────────────────────╮
2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/source/FastChat/fastchat/train/train_mem.py:13 in │
2023-05-08 10:33:14.000 [INFO] [Driver] │ │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 10 from fastchat.train.train import train │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 11 │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 12 if name == ""main"": │"
2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 13 │ train() │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 14 │
2023-05-08 10:33:14.000 [INFO] [Driver] │ │
2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/source/FastChat/fastchat/train/train.py:245 in train │
2023-05-08 10:33:14.000 [INFO] [Driver] │ │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 242 │ if list(pathlib.Path(training_args.output_dir).glob(""checkpoint-"" │"
2023-05-08 10:33:14.000 [INFO] [Driver] │ 243 │ │ trainer.train(resume_from_checkpoint=True) │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 244 │ else: │
2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 245 │ │ trainer.train() │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 246 │ trainer.save_state() │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 247 │ safe_save_model_for_hf_trainer(trainer=trainer, output_dir=trainin │"
2023-05-08 10:33:14.000 [INFO] [Driver] │ 248 │
2023-05-08 10:33:14.000 [INFO] [Driver] │ │
2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/transformer │
2023-05-08 10:33:14.000 [INFO] [Driver] │ s/trainer.py:1662 in train │
2023-05-08 10:33:14.000 [INFO] [Driver] │ │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 1659 │ │ inner_training_loop = find_executable_batch_size( │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 1660 │ │ │ self._inner_training_loop, self._train_batch_size, args.a │"
2023-05-08 10:33:14.000 [INFO] [Driver] │ 1661 │ │ ) │
2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 1662 │ │ return inner_training_loop( │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 1663 │ │ │ args=args, │"
2023-05-08 10:33:14.000 [INFO] [Driver] │ 1664 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │"
2023-05-08 10:33:14.000 [INFO] [Driver] │ 1665 │ │ │ trial=trial, │"
2023-05-08 10:33:14.000 [INFO] [Driver] │ │
2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/transformer │
2023-05-08 10:33:14.000 [INFO] [Driver] │ s/trainer.py:1996 in _inner_training_loop │
2023-05-08 10:33:14.000 [INFO] [Driver] │ │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 1993 │ │ │ │ │ │ scale_after = self.scaler.get_scale() │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 1994 │ │ │ │ │ │ optimizer_was_run = scale_before <= scale_aft │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 1995 │ │ │ │ │ else: │
2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 1996 │ │ │ │ │ │ self.optimizer.step() │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 1997 │ │ │ │ │ │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 1998 │ │ │ │ │ if optimizer_was_run and not self.deepspeed: │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 1999 │ │ │ │ │ │ self.lr_scheduler.step() │
2023-05-08 10:33:14.000 [INFO] [Driver] │ │
2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim │
2023-05-08 10:33:14.000 [INFO] [Driver] │ /lr_scheduler.py:68 in wrapper │
2023-05-08 10:33:14.000 [INFO] [Driver] │ │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 65 │ │ │ │ instance = instance_ref() │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 66 │ │ │ │ instance._step_count += 1 │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 67 │ │ │ │ wrapped = func.get(instance, cls) │"
2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 68 │ │ │ │ return wrapped(args, kwargs) │"
2023-05-08 10:33:14.000 [INFO] [Driver] │ 69 │ │ │ │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 70 │ │ │ # Note that the returned function here is no longer a bou │
2023-05-08 10:33:14.000 [INFO] [Driver] │ 71 │ │ │ # so attributes like
__func__
and__self__
no longer │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /optimizer.py:140 in wrapper │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 137 │ │ │ │ obj, _ = args │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 138 │ │ │ │ profile_name = ""Optimizer.step#{}.step"".format(obj.__c │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 139 │ │ │ │ with torch.autograd.profiler.record_function(profile_n │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 140 │ │ │ │ │ out = func(args, kwargs) │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 141 │ │ │ │ │ obj._optimizer_step_code() │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 142 │ │ │ │ │ return out │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 143 │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/autog │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ rad/grad_mode.py:27 in decorate_context │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 24 │ │ @functools.wraps(func) │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 25 │ │ def decorate_context(*args, *kwargs): │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 26 │ │ │ with self.clone(): │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 27 │ │ │ │ return func(args, *kwargs) │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 28 │ │ return cast(F, decorate_context) │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 29 │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 30 │ def _wrap_generator(self, func): │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /adamw.py:162 in step │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 159 │ │ │ │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 160 │ │ │ │ state_steps.append(state['step']) │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 161 │ │ │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 162 │ │ │ adamw(params_with_grad, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 163 │ │ │ │ grads, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 164 │ │ │ │ exp_avgs, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 165 │ │ │ │ exp_avg_sqs, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /adamw.py:219 in adamw │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 216 │ else: │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 217 │ │ func = _single_tensor_adamw │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 218 │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 219 │ func(params, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 220 │ │ grads, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 221 │ │ exp_avgs, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 222 │ │ exp_avg_sqs, │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /home/xxxxxx/miniconda3/envs/fschat/lib/python3.10/site-packages/torch/optim │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ /adamw.py:273 in _single_tensoradamw │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 270 │ │ param.mul(1 - lr weight_decay) │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 271 │ │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 272 │ │ # Decay the first and second moment running average coefficien │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ ❱ 273 │ │ expavg.mul(beta1).add_(grad, alpha=1 - beta1) │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 274 │ │ exp_avgsq.mul(beta2).addcmul_(grad, grad, value=1 - beta2) │" 2023-05-08 10:33:14.000 [INFO] [Driver] │ 275 │ │ │ 2023-05-08 10:33:14.000 [INFO] [Driver] │ 276 │ │ if capturable: │ 2023-05-08 10:33:14.000 [INFO] [Driver] ╰──────────────────────────────────────────────────────────────────────────────╯ 2023-05-08 10:33:14.000 [INFO] [Driver] RuntimeError: The size of tensor a (32768512) must match the size of tensor b 2023-05-08 10:33:14.000 [INFO] [Driver] (262148096) at non-singleton dimension 0