RuntimeError: Given groups=1, weight of size [1024, 3, 14, 14], expected input xxx to have 3 channels, but got xx channels instead

wnzhyee commented 2 months ago

Hi, I meet same problem when repetition pretraining step, it's happened when calculating patch_embedding in

https://github.com/thunlp/LLaVA-UHD/blob/302301bc2175f7e717fb8548516188e89f649753/llava_uhd/train/llava-uhd/adapt_clip.py#L79

It seems the similar problem has been raised about 2 months ago, is there any specific timeline to solve this question?

wnzhyee commented 2 months ago

Here are my some attempts about this error:

https://github.com/thunlp/LLaVA-UHD/blob/302301bc2175f7e717fb8548516188e89f649753/llava_uhd/train/llava-uhd/adapt_clip.py#L311-L357

When adapt_CLIPVisionTower forward, the input images shape is always [B, 3*k, 336, 336], but here only considers when k==1 and k==8, if k is other numbers, here will report an error.

So I change this as:

if images.shape[1] % 3 == 0:
    chunk_size = images.shape[1] // 3

    image_features = []
    split_images = torch.chunk(images, chunks=chunk_size, dim=1)
    slice_w_nums=[]
    slice_h_nums=[]
    abstract_w_nums=[]
    abstract_h_nums=[]

    for i in range(len(origin_image_widths)):
        slice_w_num,slice_h_num,abstract_w_num,abstract_h_num = get_patch_nums(origin_image_widths[i],origin_image_heights[i])
        slice_w_nums.append(slice_w_num)
        slice_h_nums.append(slice_h_num)
        abstract_w_nums.append(abstract_w_num)
        abstract_h_nums.append(abstract_h_num)

    for i, image in enumerate(split_images):

        if i == chunk_size - 1:
            image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0),
                                                output_hidden_states=True,
                                                w_patch_num = abstract_w_nums,
                                                h_patch_num = abstract_h_nums)
        else:
            image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0),
                                                output_hidden_states=True,
                                                w_patch_num = slice_w_nums,
                                                h_patch_num = slice_h_nums)

        image_feature = self.feature_select(image_forward_out).to(image.dtype)
        # print("image_feature.shape",image_feature.shape)
        # image_feature.shape torch.Size([4, 576, 1024])
        # print("image_features.shape",image_features.shape)
        image_features.append(image_feature)

And here will not report errors.

But after this, in adapt_llava.py, there seems another magic number 8 and 4:

https://github.com/thunlp/LLaVA-UHD/blob/302301bc2175f7e717fb8548516188e89f649753/llava_uhd/train/llava-uhd/adapt_llava.py#L201-L204

I don't know what 8 and 4 means, but here will report another error:

IndexError: index 24 is out of bounds for dimension 0 with size 24

and in this function, the cur_image_idx should be auto increment, but i cant see any cur_image_idx +=1 in this loop. If is correct, cur_new_input_embeds and cur_new_labels seems always use a same subset of image_features.

Please check these code and reply my confusion, thanks!

ParadoxZW commented 1 month ago

Hi @wnzhyee !

I've released another implementation of LLaVA-UHD here, which I believe is more stable and elegant. The code of the new repo originates from this repo, but its overall quality is improved, and the training program is tested to be able to normally run without bugs.

When I reviewed this old repo and tried to fix this RuntimeError issue, I found it contains a lot of hidden bugs and calculations with wrong logic (violating the spirit of the original paper), and misses some necessary process (such as, image normalization). So I decided to rewrite the code and try my best to fix all these issues. Now I open-sourced my rewritten version.

You are very welcome to use it, and I look forward to your feedback.

thunlp / LLaVA-UHD

RuntimeError: Given groups=1, weight of size [1024, 3, 14, 14], expected input xxx to have 3 channels, but got xx channels instead #25