Open qqq-gif opened 1 day ago
Hi, don't worry, I'm glad to help. Have you tried running the repo on Linux? It's going well I hope :-)
Regarding the question, we took out the attention coefficients from all decoding layers as well as the decoded text and memorized for each input the weights in the cross-attention. Suppose this sequence d1, d2, d3, d4 in the input, for each of them, since we have 3 decoding layers, there are 3 cross-attention weights of size [batch_size, num_heads=8, sequence=4, visual_features=144]. We selected one of them and visually plotted it according to the Softmax
I don't have a clean way to do it, unfortunately, since we just needed the visualization to work but I've found the snippet we used for that particular visualization part. At the end of the evaluation phase in test.py you can do something like that:
interested_image = 2481 # randomly selected
import cv2
import numpy as np
img_path, _ = mscoco_dataset.get_image_path(interested_image, MsCocoDatasetKarpathy.TestSet_ID)
print("Id: " + str(
interested_images[i]) + "---------------------------------------------------------------------")
print("Image path: " + str(img_path))
head = 0
description = pred_dict[0]['caption'].split(' ')
n_layer = 1
seq_len = len(description)
for t in range(seq_len):
coeffs_t = attention_coeffs[0][i][t][n_layer][head].tolist()
img = cv2.imread(img_path)
resized_img = cv2.resize(img, (384, 384))
height, width, _ = resized_img.shape
original_rgb = copy.copy(resized_img)
pixel_cumulative_atten = np.ones((144, height, width, 1), dtype=np.float)
pixel_cumulative_atten[:, :, :, 0] = 0.0
for coeff_idx in range(len(coeffs_t)):
ll_row = int(coeff_idx / 12) # last layer row / col
ll_col = int(coeff_idx % 12)
fl_row = ll_row * 32
fl_col = ll_col * 32
pixel_cumulative_atten[coeff_idx, int(fl_row):int(fl_row + 32),
int(fl_col):int(fl_col + 32), :] = coeffs_t[coeff_idx]
pixel_cumulative_atten_nms = pixel_cumulative_atten.max(axis=0, keepdims=False)
max_values = pixel_cumulative_atten_nms.max(axis=(1, 2), keepdims=True)
pixel_cumulative_atten_nms_norm = pixel_cumulative_atten_nms / max_values
correction = 0.7
interpol_coeff = pixel_cumulative_atten_nms_norm
brightness = 254 * (1 - interpol_coeff * correction)
new_rgb = \
np.uint8((brightness +
np.uint16(original_rgb) * interpol_coeff * correction).clip(0, 254))
print("Word: " + str(description[t]))
name = img_path # + "_" + str(t)
cv2.namedWindow(name, cv2.WINDOW_AUTOSIZE)
cv2.moveWindow(name, 40, 30)
cv2.imshow(name, new_rgb)
cv2.waitKey(0)
cv2.destroyAllWindows()
It is given that you have in pred_dict the list of predictions
pred_dict[0]['caption'].split(' ')
|_ batch index
and have stored the attention_coefficients of your custom model
attention_coeffs[0][t][n_layer][head].tolist()
| | | |_ head index
| | |_ layer index
| |__ sequence index
|_ batch index
In our case, we organized the attention_coefficients in that way, as a "list of list of list of list" and so on, which is not pretty, but actually, it is totally up to you how to get the attention out of your models. At the end of the day, you should get in the " coeffs_t" variable, a vector of 144 softmaxed weights
I leave to you the part of getting the list of predictions and the attention coefficients on your model which should be the most straightforward part
Thank you for your reply, may I ask if I set fp=16 when I extracted the features, then will I still reach 139.5 with intensive training
Yes, it should still reach around 139.5. In my experience casting the model to fp=16 does not impact the result at all
I take the liberty of bothering you again to ask how to visualize attention in the same way as your paper Fig.4, thanks!