haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
20.46k stars 2.26k forks source link

training error[Question] #614

Open ai1361720220000 opened 1 year ago

ai1361720220000 commented 1 year ago

Question

I followed the finetune.sh, and the difference is that I removed "--deepspeed ./scripts/zero3.json" and made "--group_by_modality_length", "False".

Original Traceback (most recent call last): File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker output = module(*input, kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/home/notebook/data/personal/80303875/LLaVA/llava/model/language_model/llava_llama.py", line 78, in forward outputs = self.model( File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 685, in forward layer_outputs = torch.utils.checkpoint.checkpoint( File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint return CheckpointFunction.apply(function, preserve, args) File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(args, kwargs) # type: ignore[misc] File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward outputs = run_function(args) File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 681, in custom_forward return module(inputs, output_attentions, None) File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) File "/opt/conda/envs/llava/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 408, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(args, kwargs) File "/home/notebook/data/personal/80303875/LLaVA/llava/train/llama_flash_attn_monkey_patch.py", line 87, in forward output_unpad = flash_attn_unpadded_qkvpacked_func( File "/opt/conda/envs/llava/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 406, in flash_attn_varlen_qkvpacked_func return FlashAttnVarlenQKVPackedFunc.apply( File "/opt/conda/envs/llava/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/opt/conda/envs/llava/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 123, in forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward( File "/opt/conda/envs/llava/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 52, in _flash_attn_varlen_forward out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd( RuntimeError: FlashAttention only support fp16 and bf16 data type

haotian-liu commented 1 year ago

Please do not remove the deepspeed config, as it is crucial to ensure the proper distributed training settings.

ai1361720220000 commented 1 year ago

It is ok. Thanks a lot!! But i found occasionally there will be an error when running, but it will be successful after re-running a few times.