haotian-liu / LLaVA

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
https://llava.hliu.cc
Apache License 2.0
19.59k stars 2.16k forks source link

[Usage] error about Chinese data finetuning #862

Closed CrazyBrick closed 10 months ago

CrazyBrick commented 10 months ago

Describe the issue

Issue: There is no problem fine-tuning using the provided dataset, but using local Chinese data will report this error, which is the same as the issue 134 issue, but it has been closed.

Command:

sh finetune_lora.sh[change data_path and image_folder]

finetune_lora.sh Log:

Traceback (most recent call last):
  File "/workspace/LLaVA/llava/train/train_mem.py", line 13, in <module>
    train()
  File "/workspace/LLaVA/llava/train/train.py", line 932, in train
    trainer.train()
  File "/xxx/python3.10/site-packages/transformers/trainer.py", line 1539, in train
    return inner_training_loop(
  File "/xxx/python3.10/site-packages/transformers/trainer.py", line 1809, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/xxx/python3.10/site-packages/transformers/trainer.py", line 2654, in training_step
    loss = self.compute_loss(model, inputs)
  File "/xxx/python3.10/site-packages/transformers/trainer.py", line 2679, in compute_loss
    outputs = model(**inputs)
  File "/xxx/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/xxx/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/xxx/python3.10/site-packages/deepspeed/runtime/engine.py", line 1735, in forward
    loss = self.module(*inputs, **kwargs)
  File "/xxx/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/xxx/python3.10/site-packages/peft/peft_model.py", line 922, in forward
    return self.base_model(
  File "/xxx/python3.10/site-packages/torch/nn/modules/module.py", line 1538, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/workspace/LLaVA/llava/model/language_model/llava_llama.py", line 79, in forward
    ) = self.prepare_inputs_labels_for_multimodal(
  File "/workspace/LLaVA/llava/model/llava_arch.py", line 178, in prepare_inputs_labels_for_multimodal
    cur_image_features = image_features[cur_image_idx]
IndexError: index 16 is out of bounds for dimension 0 with size 16.

I've tried to delete all explicit '\n\' in the json file.For example: "value": "<image>\n图中的人是xxx,请以这个人为主体描述一下图中内容"-->"value": "<image>图中的人是xxx,请以这个人为主体描述一下图中内容" but it doen't work.

Etelis commented 10 months ago

Had the same issue today with english data, problems were: \n and also "\ in some cases.

check those.

CrazyBrick commented 10 months ago

Had the same issue today with english data, problems were: \n and also "\ in some cases.

check those.

thank you.I checked before,but there is no '\' except '\n'. I print some vars

            for i in range(num_images + 1):
                cur_new_input_embeds.append(cur_input_embeds_no_im[i])
                cur_new_labels.append(cur_labels_noim[i])
                if i < num_images:
                    print(f"Batch Index: {batch_idx}\n, Current Image Index: {cur_image_idx}\n, Num Images: {num_images}")

in fileLLaVA/llava/model/llava_arch.py,line 178. when it runs normally, Num Images: 1 when it runs with error, Num Images: 4

bingwork commented 10 months ago

suggest to make sure one sample is right at https://github.com/haotian-liu/LLaVA/blob/main/llava/train/train.py#L402 (for example, based on which conversation template you choose) if the input_ids and labels/targets are as expected, then there will be no error in the later processes. @CrazyBrick

henrycjh commented 10 months ago

Had the same issue today with english data, problems were: \n and also "\ in some cases. check those.

thank you.I checked before,but there is no '' except '\n'. I print some vars

            for i in range(num_images + 1):
                cur_new_input_embeds.append(cur_input_embeds_no_im[i])
                cur_new_labels.append(cur_labels_noim[i])
                if i < num_images:
                    print(f"Batch Index: {batch_idx}\n, Current Image Index: {cur_image_idx}\n, Num Images: {num_images}")

in fileLLaVA/llava/model/llava_arch.py,line 178. when it runs normally, Num Images: 1 when it runs with error, Num Images: 4

@CrazyBrick Based on your message, I think you have similar problems like me, and it is because the data preprocessing is wrong. You should check that for every instance in your data list, mostly in multi turn conversations, the tag \<image> should appear only once.

CrazyBrick commented 10 months ago

Had the same issue today with english data, problems were: \n and also "\ in some cases. check those.

thank you.I checked before,but there is no '' except '\n'. I print some vars

            for i in range(num_images + 1):
                cur_new_input_embeds.append(cur_input_embeds_no_im[i])
                cur_new_labels.append(cur_labels_noim[i])
                if i < num_images:
                    print(f"Batch Index: {batch_idx}\n, Current Image Index: {cur_image_idx}\n, Num Images: {num_images}")

in fileLLaVA/llava/model/llava_arch.py,line 178. when it runs normally, Num Images: 1 when it runs with error, Num Images: 4

@CrazyBrick Based on your message, I think you have similar problems like me, and it is because the data preprocessing is wrong. You should check that for every instance in your data list, mostly in multi turn conversations, the tag should appear only once.

thx, your guess is correct, I do have multi-turns of dialogue and multipletags. Yesterday, I regenerated a version that only had one dialogue. The Num_Images have become what I expected, but the index will still increase until IndexError: index 16 is out of bounds for dimension 0 with size 16.. I have no clue...

henrycjh commented 10 months ago

@CrazyBrick That is weird, maybe you should find a way to print out the actual instance that cause this error. Because it loads the images and texts seperately, which means the length of image_features is fixed (e.g. the length of image_features is 16 in your case) before dealing with the texts. And in one batch, the cur_image_idx should only increase in two cases, first is that there is no image in this instace, second is that there is only one \<image> in the instance which makes the cur_image_idx increases only once in one instance, so if there are two or more tags in the instance, the cur_image_idx would increase twice or more and then exceed the bound of already fixed image_features.

CrazyBrick commented 10 months ago

@CrazyBrick That is weird, maybe you should find a way to print out the actual instance that cause this error. Because it loads the images and texts seperately, which means the length of image_features is fixed (e.g. the length of image_features is 16 in your case) before dealing with the texts. And in one batch, the cur_image_idx should only increase in two cases, first is that there is no image in this instace, second is that there is only one in the instance which makes the cur_image_idx increases only once in one instance, so if there are two or more tags in the instance, the cur_image_idx would increase twice or more and then exceed the bound of already fixed image_features.

thanks @henrycjh quite weird, I debug for a while but I didn't find anything strange.But after regenerating the custom dataset for multi-tunes of dialogue with 1 tag, it seems like there are no errors. I don't understand, but in the end, it can run.

But I have to admit that my fine-tuning did not achieve the expected effect and seems to have no effect. I don't know if it's due to too little data or other parameter reasons.

henrycjh commented 10 months ago

@CrazyBrick Glad that you make it work. I guess the main reason that fine-tuning has no effect is that the LLM is pretrained mainly on English data, so it may still have no effect even you have a lot of Chinese data during fine-tuing stage.

CrazyBrick commented 10 months ago

@CrazyBrick Glad that you make it work. I guess the main reason that fine-tuning has no effect is that the LLM is pretrained mainly on English data, so it may still have no effect even you have a lot of Chinese data during fine-tuing stage.

@henrycjh thank you for your help, I selected a portion from the dataset to generate English data, but the prediction is quite poor. I am confused about how to achieve effective finetune, how many constraints are needed, and what kind of expectations can be achieved I've proposed a new issue #884

ScottishFold007 commented 6 months ago

@CrazyBrick Glad that you make it work. I guess the main reason that fine-tuning has no effect is that the LLM is pretrained mainly on English data, so it may still have no effect even you have a lot of Chinese data during fine-tuing stage.

@henrycjh thank you for your help, I selected a portion from the dataset to generate English data, but the prediction is quite poor. I am confused about how to achieve effective finetune, how many constraints are needed, and what kind of expectations can be achieved I've proposed a new issue #884

Did you do the 2 stage training from zero to one? In that case, intervening from stage 1 with Chinese data and mixing in some in stage 2 might work much better. That's what I did myself, and my Chinese skills still improved visibly

ScottishFold007 commented 6 months ago

@CrazyBrick Glad that you make it work. I guess the main reason that fine-tuning has no effect is that the LLM is pretrained mainly on English data, so it may still have no effect even you have a lot of Chinese data during fine-tuing stage.

@henrycjh thank you for your help, I selected a portion from the dataset to generate English data, but the prediction is quite poor. I am confused about how to achieve effective finetune, how many constraints are needed, and what kind of expectations can be achieved I've proposed a new issue #884

I'm assuming you're translating the Chinese data from English? That's what I did, so I ran into a situation where multiple appeared in a single session.