데이터셋 문제는 덕분에 해결됐습니다. 감사합니다. 이후로도 조금씩 없는 데이터 문제가 뜨긴 했지만, 데이터허브에서 받은 파일에 있는거라 쉽게 해결했습니다.
그런데 현재는 "RuntimeError: shape '[16, 2048, 32, 128]' is invalid for input of size 33554432" 에러 때문에 또다시 학습이 막힌 상태입니다.
.view(bsz, q_len, self.num_heads, self.head_dim)로 결정되고, 'bsz'랑 'q_len'은 "finetune_lora.sh"의 per_device_train_batch_size, model_max_length가 결정한다는 것은 알고있습니다. 그래서 'bsz'를 16에서 4로 바꿔봤더니 "RuntimeError: shape '[4, 2048, 32, 128]' is invalid for input of size 8388608"이 뜨더라고요. 'q_len'을 바꿔봐도 소용이 없고요.
혹시 파인튜닝용 KoLLaVA-v1.5-Synatra-7B 모델의 global batch size의 크기가(128) 이 문제에 기여하는지 확인해봤는데, 그건 또 아닌것 같습니다. (참고로 이 128에 맞추기 위해 gradient_accumulation_steps값을 4로 바꿨습니다. 제 서버 환경의 GPU 개수가 2개여서요.)
역시나 이는 'self.num_heads', 'self.head_dim'와는 전혀 관계가 없어보이는데, 어떤 값을 수정해야 할까요?
다시 한번 감사드리며, 커맨드창의 에러 부분만 공유해드리겠습니다.
wandb: 🚀 View run at https://wandb.ai/jiwon_ha/huggingface/runs/d5rk2eng
0%| | 0/4543 [00:00<?, ?it/s]/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
warnings.warn(
/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/checkpoint.py:464: UserWarning: torch.utils.checkpoint: the use_reentrant parameter should be passed explicitly. In version 2.4 we will raise an exception if use_reentrant is not passed. use_reentrant=False is recommended, but if you need to preserve the current default behavior, you can pass use_reentrant=True. Refer to docs for more details on the differences between the two variants.
warnings.warn(
Traceback (most recent call last):
File "/home/work/testdataset1/KoLLaVA/llava/train/train_xformers.py", line 13, in
train()
File "/home/work/testdataset1/KoLLaVA/llava/train/train.py", line 933, in train
trainer.train()
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
return inner_training_loop(
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
loss = self.compute_loss(model, inputs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
outputs = model(inputs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
return forward_call(args, kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1735, in forward
loss = self.module(*inputs, *kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
result = forward_call(*args, kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/peft/peft_model.py", line 922, in forward
return self.base_model(
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
result = forward_call(args, kwargs)
File "/home/work/testdataset1/KoLLaVA/llava/model/language_model/llava_llama.py", line 88, in forward
return super().forward(
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
outputs = self.model(
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
result = forward_call(*args, *kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward
layer_outputs = torch.utils.checkpoint.checkpoint(
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
return torch._dynamo.disable(fn, recursive)(args, kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
return fn(*args, kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 36, in inner
return fn(*args, kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 487, in checkpoint
return CheckpointFunction.apply(function, preserve, args)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply
return super().apply(args, kwargs) # type: ignore[misc]
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 262, in forward
outputs = run_function(args)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward
return module(inputs, output_attentions, None)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
rank1: Traceback (most recent call last):
rank1: File "/home/work/testdataset1/KoLLaVA/llava/train/train_xformers.py", line 13, in
rank1: File "/home/work/testdataset1/KoLLaVA/llava/train/train.py", line 933, in train
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
rank1: return inner_training_loop(
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
rank1: tr_loss_step = self.training_step(model, inputs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
rank1: loss = self.compute_loss(model, inputs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
rank1: outputs = model(inputs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank1: return self._call_impl(*args, *kwargs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
rank1: return forward_call(args, kwargs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
rank1: ret_val = func(*args, kwargs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1735, in forward
rank1: loss = self.module(*inputs, *kwargs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank1: return self._call_impl(args, kwargs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
rank1: result = forward_call(*args, kwargs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/peft/peft_model.py", line 922, in forward
rank1: return self.base_model(
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank1: return self._call_impl(*args, *kwargs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
rank1: result = forward_call(args, kwargs)
rank1: File "/home/work/testdataset1/KoLLaVA/llava/model/language_model/llava_llama.py", line 88, in forward
rank1: return super().forward(
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
rank1: outputs = self.model(
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank1: return self._call_impl(*args, kwargs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
rank1: result = forward_call(*args, *kwargs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward
rank1: layer_outputs = torch.utils.checkpoint.checkpoint(
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
rank1: return torch._dynamo.disable(fn, recursive)(args, kwargs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
rank1: return fn(*args, kwargs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 36, in inner
rank1: return fn(*args, kwargs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 487, in checkpoint
rank1: return CheckpointFunction.apply(function, preserve, args)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply
rank1: return super().apply(args, kwargs) # type: ignoremisc: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 262, in forward
rank1: outputs = run_function(args)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward
rank1: return module(inputs, output_attentions, None)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank1: return self._call_impl(*args, kwargs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
rank1: result = forward_call(*args, kwargs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
rank1: hidden_states, self_attn_weights, present_key_value = self.self_attn(
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank1: return self._call_impl(*args, *kwargs)
rank1: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
rank1: result = forward_call(args, kwargs)
rank1: File "/home/work/testdataset1/KoLLaVA/llava/train/llama_xformers_attn_monkey_patch.py", line 42, in xformers_forward
rank1: .view(bsz, q_len, self.num_heads, self.head_dim)
rank1: RuntimeError: shape '[16, 2048, 32, 128]' is invalid for input of size 33554432
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
result = forward_call(*args, kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
result = forward_call(args, kwargs)
File "/home/work/testdataset1/KoLLaVA/llava/train/llama_xformers_attn_monkey_patch.py", line 42, in xformers_forward
.view(bsz, q_len, self.num_heads, self.head_dim)
RuntimeError: shape '[16, 2048, 32, 128]' is invalid for input of size 33554432
rank0: Traceback (most recent call last):
rank0: File "/home/work/testdataset1/KoLLaVA/llava/train/train_xformers.py", line 13, in
rank0: File "/home/work/testdataset1/KoLLaVA/llava/train/train.py", line 933, in train
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
rank0: return inner_training_loop(
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
rank0: tr_loss_step = self.training_step(model, inputs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
rank0: loss = self.compute_loss(model, inputs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
rank0: outputs = model(inputs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank0: return self._call_impl(*args, *kwargs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
rank0: return forward_call(args, kwargs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
rank0: ret_val = func(*args, kwargs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1735, in forward
rank0: loss = self.module(*inputs, *kwargs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank0: return self._call_impl(args, kwargs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
rank0: result = forward_call(*args, kwargs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/peft/peft_model.py", line 922, in forward
rank0: return self.base_model(
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank0: return self._call_impl(*args, *kwargs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
rank0: result = forward_call(args, kwargs)
rank0: File "/home/work/testdataset1/KoLLaVA/llava/model/language_model/llava_llama.py", line 88, in forward
rank0: return super().forward(
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 806, in forward
rank0: outputs = self.model(
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank0: return self._call_impl(*args, kwargs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
rank0: result = forward_call(*args, *kwargs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward
rank0: layer_outputs = torch.utils.checkpoint.checkpoint(
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_compile.py", line 24, in inner
rank0: return torch._dynamo.disable(fn, recursive)(args, kwargs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
rank0: return fn(*args, kwargs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/_dynamo/external_utils.py", line 36, in inner
rank0: return fn(*args, kwargs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 487, in checkpoint
rank0: return CheckpointFunction.apply(function, preserve, args)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply
rank0: return super().apply(args, kwargs) # type: ignoremisc: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 262, in forward
rank0: outputs = run_function(args)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward
rank0: return module(inputs, output_attentions, None)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank0: return self._call_impl(*args, kwargs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
rank0: result = forward_call(*args, kwargs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward
rank0: hidden_states, self_attn_weights, present_key_value = self.self_attn(
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
rank0: return self._call_impl(*args, *kwargs)
rank0: File "/home/work/anaconda3/envs/kollava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
rank0: result = forward_call(args, kwargs)
rank0: File "/home/work/testdataset1/KoLLaVA/llava/train/llama_xformers_attn_monkey_patch.py", line 42, in xformers_forward
rank0: .view(bsz, q_len, self.num_heads, self.head_dim)
rank0: RuntimeError: shape '[16, 2048, 32, 128]' is invalid for input of size 33554432
데이터셋 문제는 덕분에 해결됐습니다. 감사합니다. 이후로도 조금씩 없는 데이터 문제가 뜨긴 했지만, 데이터허브에서 받은 파일에 있는거라 쉽게 해결했습니다.
그런데 현재는 "RuntimeError: shape '[16, 2048, 32, 128]' is invalid for input of size 33554432" 에러 때문에 또다시 학습이 막힌 상태입니다. .view(bsz, q_len, self.num_heads, self.head_dim)로 결정되고, 'bsz'랑 'q_len'은 "finetune_lora.sh"의 per_device_train_batch_size, model_max_length가 결정한다는 것은 알고있습니다. 그래서 'bsz'를 16에서 4로 바꿔봤더니 "RuntimeError: shape '[4, 2048, 32, 128]' is invalid for input of size 8388608"이 뜨더라고요. 'q_len'을 바꿔봐도 소용이 없고요.
혹시 파인튜닝용 KoLLaVA-v1.5-Synatra-7B 모델의 global batch size의 크기가(128) 이 문제에 기여하는지 확인해봤는데, 그건 또 아닌것 같습니다. (참고로 이 128에 맞추기 위해 gradient_accumulation_steps값을 4로 바꿨습니다. 제 서버 환경의 GPU 개수가 2개여서요.)
역시나 이는 'self.num_heads', 'self.head_dim'와는 전혀 관계가 없어보이는데, 어떤 값을 수정해야 할까요?
다시 한번 감사드리며, 커맨드창의 에러 부분만 공유해드리겠습니다.