Finetuning Florence on the forgetting problem

lucasjinreal commented 2 weeks ago

Hi, finetuning on flrorence-ft model, the model gets forgetting in old knowledge, (the way we using is not use florence directly, we training it and then adopt the vision encoder to large LLM instead)

Is there a way to keep the original ability while using new data continue train it?

tastelikefeet commented 1 week ago

This is a common problem, and we have many generic multi-modal datasets like okvqa/gqa, you can train this model mixed with these datasets: https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/Supported-models-datasets.md#Datasets search multi-modal in this page to find them.

lucasjinreal commented 1 week ago

Hello, I found taining florence2 got errors on A100, while using 910B is OK, is there any reason for this?

The error is extremly weired.

../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [107,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [108,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [109,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [110,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [111,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [112,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [113,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [114,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [115,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [116,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [117,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [118,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [119,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [120,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [121,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [122,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [123,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [124,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [125,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize failed. ../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize failed. rank6: Traceback (most recent call last): rank6: File "pretrain_flr2.py", line 861, in

rank6: File "pretrain_flr2.py", line 834, in train

rank6: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1885, in train rank6: return inner_training_loop( rank6: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2216, in _inner_training_loop rank6: tr_loss_step = self.training_step(model, inputs) rank6: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 3241, in training_step

rank6: File "/usr/local/lib/python3.8/dist-packages/torch/cuda/memory.py", line 162, in empty_cache

rank6: RuntimeError: CUDA error: device-side assert triggered rank6: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank6: For debugging consider passing CUDA_LAUNCH_BLOCKING=1. rank6: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2024-07-08 16:54:35,787] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 96561 [2024-07-08 16:54:36,102] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 96562 [2024-07-08 16:54:36,522] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 96563 [2024-07-08 16:54:36,847] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 96585 [2024-07-08 16:54:37,239] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 96691 [2024-07-08 16:54:37,631] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 96756 [2024-07-08 16:54:38,025] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 96757 [2024-07-08 16:54:38,025] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 96758

This is not caused by OOM since I miminal the batchSize to 2, my 910B can train with BatchSize 5, and A100 with 80G mem.

Have no clue why it behavior like this. Any message for it?

hjh0119 commented 1 week ago

@lucasjinreal hi

Ensure your CUDA version is correct: Are other models training normally?
Does using --use_flash_attn true cause the same error? I tested it before and it was fine

lucasjinreal commented 1 week ago

I take a deep dig into it.

I found this happens only when a large scale dataset applied, which is very weired.

Not sure if it caused by some bad samples in data.

hjh0119 commented 1 week ago

I suspect it might be due to excessively long data? As the max length for Florence is relatively small (I recall it being 1024).

lucasjinreal commented 1 week ago

If the length to long, it shouldn't make whole program crash? I am not sure about this, could it be? What will happen if the lenght exceed 1024?

hjh0119 commented 1 week ago

The error seems to be caused by excessively long inputs. Nonetheless, I will add an update with a maximum length limit to prevent this situation.

lucasjinreal commented 1 week ago

Hi, I forced model_max_length to 4096, same error still occured. While using another relatively small dataset, it's normal

I am driven mad about this, any thoughts?

Shall I force limit the characters lenght in answers?

hjh0119 commented 1 week ago

The model itself does not support excessively long inputs, so setting model_max_length to 4096 does not solve the problem. Therefore, I have added length truncation to prevent crashes due to excessively long inputs.

lucasjinreal commented 1 week ago

Hi, I try training it further and the purpoe is just extend to longer outputs, but I found, it's not just not support, it's crash,

this is output of expect, since the output labels, should not limited to 1024 for a generative model/.

hjh0119 commented 1 week ago

Are you suggesting to truncate only the input? I'm not sure if this is feasible because in the official example, max_new_tokens in the generate function is set to 1024.

lucasjinreal commented 1 week ago

the input_ids should be the question, such as <OCR> etc, (translated into a short text later), the labels is limited to 1024 which I don't know why.

If the labels execeed 1024, it just crash.

Theortically it shouldn't.

The purpose for me retraining is actually to make it extend to 4096, now I am stucked.

modelscope / swift

Finetuning Florence on the forgetting problem #1267