Closed lucasjinreal closed 2 months ago
This is a common problem, and we have many generic multi-modal datasets like okvqa/gqa, you can train this model mixed with these datasets: https://github.com/modelscope/swift/blob/main/docs/source_en/LLM/Supported-models-datasets.md#Datasets search multi-modal in this page to find them.
Hello, I found taining florence2 got errors on A100, while using 910B is OK, is there any reason for this?
The error is extremly weired.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [107,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [108,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [109,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [110,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [111,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [112,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [113,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [114,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [115,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [116,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [117,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [118,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [119,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [120,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [121,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [122,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [123,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [124,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [125,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [126,0,0] Assertion srcIndex < srcSelectDimSize
failed.
../aten/src/ATen/native/cuda/Indexing.cu:1289: indexSelectLargeIndex: block: [203,0,0], thread: [127,0,0] Assertion srcIndex < srcSelectDimSize
failed.
rank6: Traceback (most recent call last):
rank6: File "pretrain_flr2.py", line 861, in
rank6: File "pretrain_flr2.py", line 834, in train
rank6: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 1885, in train rank6: return inner_training_loop( rank6: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 2216, in _inner_training_loop rank6: tr_loss_step = self.training_step(model, inputs) rank6: File "/usr/local/lib/python3.8/dist-packages/transformers/trainer.py", line 3241, in training_step
rank6: File "/usr/local/lib/python3.8/dist-packages/torch/cuda/memory.py", line 162, in empty_cache
rank6: RuntimeError: CUDA error: device-side assert triggered
rank6: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
rank6: For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
rank6: Compile with TORCH_USE_CUDA_DSA
to enable device-side assertions.
[2024-07-08 16:54:35,787] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 96561 [2024-07-08 16:54:36,102] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 96562 [2024-07-08 16:54:36,522] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 96563 [2024-07-08 16:54:36,847] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 96585 [2024-07-08 16:54:37,239] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 96691 [2024-07-08 16:54:37,631] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 96756 [2024-07-08 16:54:38,025] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 96757 [2024-07-08 16:54:38,025] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 96758
This is not caused by OOM since I miminal the batchSize to 2, my 910B can train with BatchSize 5, and A100 with 80G mem.
Have no clue why it behavior like this. Any message for it?
@lucasjinreal hi
--use_flash_attn true
cause the same error? I tested it before and it was fineI take a deep dig into it.
I found this happens only when a large scale dataset applied, which is very weired.
Not sure if it caused by some bad samples in data.
I suspect it might be due to excessively long data? As the max length for Florence is relatively small (I recall it being 1024).
If the length to long, it shouldn't make whole program crash? I am not sure about this, could it be? What will happen if the lenght exceed 1024?
The error seems to be caused by excessively long inputs. Nonetheless, I will add an update with a maximum length limit to prevent this situation.
Hi, I forced model_max_length to 4096, same error still occured. While using another relatively small dataset, it's normal
I am driven mad about this, any thoughts?
Shall I force limit the characters lenght in answers?
The model itself does not support excessively long inputs, so setting model_max_length to 4096 does not solve the problem. Therefore, I have added length truncation to prevent crashes due to excessively long inputs.
Hi, I try training it further and the purpoe is just extend to longer outputs, but I found, it's not just not support, it's crash,
this is output of expect, since the output labels, should not limited to 1024 for a generative model/.
Are you suggesting to truncate only the input? I'm not sure if this is feasible because in the official example, max_new_tokens in the generate function is set to 1024.
the input_ids should be the question, such as <OCR>
etc, (translated into a short text later), the labels is limited to 1024 which I don't know why.
If the labels execeed 1024, it just crash.
Theortically it shouldn't.
The purpose for me retraining is actually to make it extend to 4096, now I am stucked.
Hi, finetuning on flrorence-ft model, the model gets forgetting in old knowledge, (the way we using is not use florence directly, we training it and then adopt the vision encoder to large LLM instead)
Is there a way to keep the original ability while using new data continue train it?