jchenghu / ExpansionNet_v2

Implementation code of the work "Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning"
https://arxiv.org/abs/2208.06551
MIT License
84 stars 24 forks source link

I take the liberty of bothering you again to ask how to visualize attention in the same way as your paper Fig.4, thanks! #25

Open qqq-gif opened 1 day ago

qqq-gif commented 1 day ago

I take the liberty of bothering you again to ask how to visualize attention in the same way as your paper Fig.4, thanks!

jchenghu commented 1 day ago

Hi, don't worry, I'm glad to help. Have you tried running the repo on Linux? It's going well I hope :-)

Regarding the question, we took out the attention coefficients from all decoding layers as well as the decoded text and memorized for each input the weights in the cross-attention. Suppose this sequence d1, d2, d3, d4 in the input, for each of them, since we have 3 decoding layers, there are 3 cross-attention weights of size [batch_size, num_heads=8, sequence=4, visual_features=144]. We selected one of them and visually plotted it according to the Softmax

I don't have a clean way to do it, unfortunately, since we just needed the visualization to work but I've found the snippet we used for that particular visualization part. At the end of the evaluation phase in test.py you can do something like that:

        interested_image = 2481   # randomly selected
        import cv2
        import numpy as np

        img_path, _ = mscoco_dataset.get_image_path(interested_image, MsCocoDatasetKarpathy.TestSet_ID)

        print("Id: " + str(
            interested_images[i]) + "---------------------------------------------------------------------")
        print("Image path: " + str(img_path))
        head = 0
        description = pred_dict[0]['caption'].split(' ')
        n_layer = 1
        seq_len = len(description)
        for t in range(seq_len):
            coeffs_t = attention_coeffs[0][i][t][n_layer][head].tolist()

            img = cv2.imread(img_path)

            resized_img = cv2.resize(img, (384, 384))

            height, width, _ = resized_img.shape
            original_rgb = copy.copy(resized_img)

            pixel_cumulative_atten = np.ones((144, height, width, 1), dtype=np.float)
            pixel_cumulative_atten[:, :, :, 0] = 0.0
            for coeff_idx in range(len(coeffs_t)):
                ll_row = int(coeff_idx / 12)  # last layer row / col
                ll_col = int(coeff_idx % 12)
                fl_row = ll_row * 32
                fl_col = ll_col * 32
                pixel_cumulative_atten[coeff_idx, int(fl_row):int(fl_row + 32),
                int(fl_col):int(fl_col + 32), :] = coeffs_t[coeff_idx]

            pixel_cumulative_atten_nms = pixel_cumulative_atten.max(axis=0, keepdims=False)
            max_values = pixel_cumulative_atten_nms.max(axis=(1, 2), keepdims=True)
            pixel_cumulative_atten_nms_norm = pixel_cumulative_atten_nms / max_values
            correction = 0.7
            interpol_coeff = pixel_cumulative_atten_nms_norm
            brightness = 254 * (1 - interpol_coeff * correction)
            new_rgb = \
                np.uint8((brightness +
                          np.uint16(original_rgb) * interpol_coeff * correction).clip(0, 254))

            print("Word: " + str(description[t]))

            name = img_path  # + "_" + str(t)
            cv2.namedWindow(name, cv2.WINDOW_AUTOSIZE)
            cv2.moveWindow(name, 40, 30)
            cv2.imshow(name, new_rgb)
            cv2.waitKey(0)

        cv2.destroyAllWindows()

It is given that you have in pred_dict the list of predictions

    pred_dict[0]['caption'].split(' ')
              |_ batch index 

and have stored the attention_coefficients of your custom model

   attention_coeffs[0][t][n_layer][head].tolist()
                    |  |     |      |_ head index
                    |  |     |_ layer index
                    |  |__  sequence index
                    |_ batch index

In our case, we organized the attention_coefficients in that way, as a "list of list of list of list" and so on, which is not pretty, but actually, it is totally up to you how to get the attention out of your models. At the end of the day, you should get in the " coeffs_t" variable, a vector of 144 softmaxed weights

I leave to you the part of getting the list of predictions and the attention coefficients on your model which should be the most straightforward part

qqq-gif commented 15 hours ago

Thank you for your reply, may I ask if I set fp=16 when I extracted the features, then will I still reach 139.5 with intensive training

jchenghu commented 5 hours ago

Yes, it should still reach around 139.5. In my experience casting the model to fp=16 does not impact the result at all