philschmid / deep-learning-pytorch-huggingface

MIT License
697 stars 159 forks source link

Error when training peft model example #18

Open Tachyon5 opened 1 year ago

Tachyon5 commented 1 year ago

Hi, I am trying to train the example training/peft-flan-t5-int8-summarization.ipynb I am using a

p3dn.24xlarge | 8GPU | 96 | 768 | 256(vram). I am simply trying to run the example directly on the machine exactly as written however I am getting this error when calling train().

trainer.train()

trainer.train() /home/ubuntu/.local/lib/python3.8/site-packages/transformers/optimization.py:391: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or setno_deprecation_warning=Trueto disable this warning warnings.warn( 0%| | 0/1155 [00:00<?, ?it/s]You're using a T5TokenizerFast tokenizer. Please note that with a fast tokenizer, using thecallmethod is faster than using a method to encode the text followed by a call to thepadmethod to get a padded encoding. /home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py:298: UserWarning: MatMul8bitLt: inputs will be cast from torch.float32 to float16 during quantization warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization") ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ in <module>:1 │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py:1633 in train │ │ │ │ 1630 │ │ inner_training_loop = find_executable_batch_size( │ │ 1631 │ │ │ self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size │ │ 1632 │ │ ) │ │ ❱ 1633 │ │ return inner_training_loop( │ │ 1634 │ │ │ args=args, │ │ 1635 │ │ │ resume_from_checkpoint=resume_from_checkpoint, │ │ 1636 │ │ │ trial=trial, │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/accelerate/utils/memory.py:124 in decorator │ │ │ │ 121 │ │ │ if batch_size == 0: │ │ 122 │ │ │ │ raise RuntimeError("No executable batch size found, reached zero.") │ │ 123 │ │ │ try: │ │ ❱ 124 │ │ │ │ return function(batch_size, *args, **kwargs) │ │ 125 │ │ │ except Exception as e: │ │ 126 │ │ │ │ if should_reduce_batch_size(e): │ │ 127 │ │ │ │ │ gc.collect() │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py:1902 in │ │ _inner_training_loop │ │ │ │ 1899 │ │ │ │ │ with model.no_sync(): │ │ 1900 │ │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │ │ 1901 │ │ │ │ else: │ │ ❱ 1902 │ │ │ │ │ tr_loss_step = self.training_step(model, inputs) │ │ 1903 │ │ │ │ │ │ 1904 │ │ │ │ if ( │ │ 1905 │ │ │ │ │ args.logging_nan_inf_filter │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py:2645 in training_step │ │ │ │ 2642 │ │ │ return loss_mb.reduce_mean().detach().to(self.args.device) │ │ 2643 │ │ │ │ 2644 │ │ with self.compute_loss_context_manager(): │ │ ❱ 2645 │ │ │ loss = self.compute_loss(model, inputs) │ │ 2646 │ │ │ │ 2647 │ │ if self.args.n_gpu > 1: │ │ 2648 │ │ │ loss = loss.mean() # mean() to average on multi-gpu parallel training │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/transformers/trainer.py:2677 in compute_loss │ │ │ │ 2674 │ │ │ labels = inputs.pop("labels") │ │ 2675 │ │ else: │ │ 2676 │ │ │ labels = None │ │ ❱ 2677 │ │ outputs = model(**inputs) │ │ 2678 │ │ # Save past state if it exists │ │ 2679 │ │ # TODO: this needs to be fixed and made cleaner later. │ │ 2680 │ │ if self.args.past_index >= 0: │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py:1501 in _call_impl │ │ │ │ 1498 │ │ if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks │ │ 1499 │ │ │ │ or _global_backward_pre_hooks or _global_backward_hooks │ │ 1500 │ │ │ │ or _global_forward_hooks or _global_forward_pre_hooks): │ │ ❱ 1501 │ │ │ return forward_call(*args, **kwargs) │ │ 1502 │ │ # Do not call functions when jit is used │ │ 1503 │ │ full_backward_hooks, non_full_backward_hooks = [], [] │ │ 1504 │ │ backward_pre_hooks = [] │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py:171 in │ │ forward │ │ │ │ 168 │ │ │ if len(self.device_ids) == 1: │ │ 169 │ │ │ │ return self.module(*inputs[0], **kwargs[0]) │ │ 170 │ │ │ replicas = self.replicate(self.module, self.device_ids[:len(inputs)]) │ │ ❱ 171 │ │ │ outputs = self.parallel_apply(replicas, inputs, kwargs) │ │ 172 │ │ │ return self.gather(outputs, self.output_device) │ │ 173 │ │ │ 174 │ def replicate(self, module, device_ids): │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py:181 in │ │ parallel_apply │ │ │ │ 178 │ │ return scatter_kwargs(inputs, kwargs, device_ids, dim=self.dim) │ │ 179 │ │ │ 180 │ def parallel_apply(self, replicas, inputs, kwargs): │ │ ❱ 181 │ │ return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)]) │ │ 182 │ │ │ 183 │ def gather(self, outputs, output_device): │ │ 184 │ │ return gather(outputs, output_device, dim=self.dim) │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py:89 in │ │ parallel_apply │ │ │ │ 86 │ for i in range(len(inputs)): │ │ 87 │ │ output = results[i] │ │ 88 │ │ if isinstance(output, ExceptionWrapper): │ │ ❱ 89 │ │ │ output.reraise() │ │ 90 │ │ outputs.append(output) │ │ 91 │ return outputs │ │ 92 │ │ │ │ /home/ubuntu/.local/lib/python3.8/site-packages/torch/_utils.py:644 in reraise │ │ │ │ 641 │ │ │ # If the exception takes multiple arguments, don't try to │ │ 642 │ │ │ # instantiate since we don't know how to │ │ 643 │ │ │ raise RuntimeError(msg) from None │ │ ❱ 644 │ │ raise exception │ │ 645 │ │ 646 │ │ 647 def _get_available_device_type(): │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: Caught RuntimeError in replica 0 on device 0. Original Traceback (most recent call last): File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker output = module(*input, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/peft/peft_model.py", line 667, in forward return self.base_model( File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1667, in forward encoder_outputs = self.encoder( File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1061, in forward layer_outputs = checkpoint( File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, *args) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(*args) File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 1057, in custom_forward return tuple(module(*inputs, use_cache, output_attentions)) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 693, in forward self_attention_outputs = self.layer[0]( File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 600, in forward attention_output = self.SelfAttention( File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/transformers/models/t5/modeling_t5.py", line 572, in forward attn_output = self.o(attn_output) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/accelerate/hooks.py", line 165, in new_forward output = old_forward(*args, **kwargs) File "/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/nn/modules.py", line 242, in forward out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state) File "/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul return MatMul8bitLt.apply(A, B, out, bias, state) File "/home/ubuntu/.local/lib/python3.8/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/autograd/_functions.py", line 397, in forward output += torch.matmul(subA, state.subB) RuntimeError: mat1 and mat2 shapes cannot be multiplied (2048x2 and 1x4096)

philschmid commented 1 year ago

Exact dataset? Exact code? no changes at all?

Tachyon5 commented 1 year ago

Yes, I just ran the code as it is in the notebook. Only difference is the machine and I ran in the ipython REPL.

tomdzh commented 1 year ago

Got the same issue. It seems to happen only on multi-GPU machines. For instance, g5.4xlarge works but g5.12xlarge doesn't work.

Tachyon5 commented 1 year ago

I suspected exactly that. I spun up a single GPU machine and it works but it's slow slow slow. I'm going to see if creating a custom device_map has any affect.

philschmid commented 1 year ago

Thank you, @tomdzh ! Yes, the example with int-8 is not yet working on a multi-GPU setup. You would need to combine PEFT with DS or FSDP for that.

tomdzh commented 1 year ago

@philschmid it would be great to publish a blog about combing PEFT with FSDP :).