hehefan / P4Transformer

Implementation of the "Point 4D Transformer Networks for Spatio-Temporal Modeling in Point Cloud Videos" paper.
MIT License
165 stars 24 forks source link

visualize tansformer's attention #16

Open weiyutao886 opened 2 years ago

weiyutao886 commented 2 years ago

I want to visualize tansformer's attention. I see that Fig4 in your paper visualizes it. Can you tell me where and how to visualize it? Can you share the visualization code? Thank you

hehefan commented 2 years ago

Hi,

The visualization in the paper is generated by Mayavi. You can use it to visualize the self-attention map attn at https://github.com/hehefan/P4Transformer/blob/main/modules-pytorch-1.8.1/transformer.py#L60.

Best regards.

weiyutao886 commented 2 years ago

Thank you for your reply. Do you save atten's data during the training and then visualize it, or do you visualize it directly? In addition, you can refer to https://docs.enthought.com/mayavi/mayavi/auto/mlab_helper_functions.html#points3d Is this part of the code? This is a sequence of data. Do you need to process atten to separate a single model before visualization? I don't know much about visualization here. I'm sorry to bother you again

hehefan commented 2 years ago

Hi,

Apologies for my late reply.

I saved the attn data during the evaluation. I used the points3d function of Mayavi to visualize each frame. Also, note that you need to save the position of each query area.

Best.

weiyutao886 commented 2 years ago

I saw that the dimension of the Attn is [bitchsize, head, cl, cl], bitchsize is the batch, head is the number of transfomer headers, and c*l is the product of the number of frames and points. You also mentioned in the paper, but the Attn has no point position. How can I visualize the shape of the person in your paper according to the Attn, that is to say, visualizing the Attn is simply visualizing its weight, Moreover, the dimension of Attn without point position cannot be visualized. I'm sorry to trouble you again.

hehefan commented 2 years ago

Hi,

Point features and attentions are associated with point positions/coordinates in point-based methods. It is allowed to manually associate them by modifying the code.

In [bitchsize, head, c1, c2], c1 is the query and c2 is the attention.

Best.

weiyutao886 commented 2 years ago

What you mean is that C1 represents the query point Q, and C2 is the corresponding C2 weight parameters of each query point Q, so I can take out the C2 weight and assign it on the point cloud coordinates to realize visualization, right. I do this at present, but there are only 128 points in the point cloud of each frame. The visualization effect is not good. Do you have any suggestions.

weiyutao886 commented 2 years ago

I read your code, The data you input into transformer is ([14, 12, 1024, 64], that is, 12 frames, 64 points in each frame, and 12 * 64 points in Attn. In this way, there are 64 points in each frame. If I visualize each frame, only 64 points have weights, and the visualized point cloud and its weights of these 64 points are also. However, I see that the visualized graph in your paper is composed of many points. How do you handle it

weiyutao886 commented 2 years ago

At present, the visualization I understand is to assign the point cloud of each frame to its corresponding weight on Attn. However, there are only 64 points in each frame. This visualization seems to have no effect. I hope you can point out my problem

hehefan commented 2 years ago

Hi, you need to upsample points via the feature propagation operation in PointNet++.

weiyutao886 commented 2 years ago

Thank you for your patient reply. Do you mean that I need to add a pointnet + + before the transformer to upsample the visual results similar to those in your paper? I have another question. How many point clouds do I need to upsample? Too many points will cause CUDA out of memory problems

hehefan commented 2 years ago

Nope.

First, when I made the visualization, I saved the input point clouds and the corresponding subsampled self-attention weights. Because the input is 2048 points, the visualization is of 2048 points.

Second, what the feature propagation operation does is to interpolate back the subsampled weights to the original input points based on distance. Suppose a is an original point, and b and c are subsampled points with attentions B and C, respectively. The a's attention will be B/||b-a|| + C/||c-a||.

weiyutao886 commented 2 years ago

for msr model i do this:

def forward(self, input): # [B, L, N, 3]
device = input.get_device()
# print("input=",input.shape)
###################################saveinput
input2 = input.cpu().detach()

input2 = input2.reshape(input2.shape[0] * input2.shape[2] * input2.shape[3], input2.shape[1])[:, :3]
print('input111=', input2.shape)
np.savetxt(r"/root/autodl-tmp/result1/result1.txt", input2)

xyzs, features = self.tube_embedding(input)                                                                                         # [B, L, n, 3], [B, L, C, n]
################################savexyzs
input3 = xyzs.cpu().detach()
input3 = input3.reshape(input3.shape[0] * input3.shape[2] * input3.shape[3], input3.shape[1])[:, :3]
print('input111=', input3.shape)
np.savetxt(r"/root/autodl-tmp/result1/result2.txt", input3)

# print("fea00=", features.shape)
# print("xyz00=", xyzs.shape)

for transformer i do this:

dots = einsum('b h i d, b h j d -> b h i j', q, k) * self.scale

attn = dots.softmax(dim=-1)
# print("attn=",attn.shape)
attn1 = attn.cpu().detach()
attn2 = attn1[:1, :1, :, :]
attn2 = attn2.reshape(attn2.shape[0] * attn2.shape[1] * attn2.shape[2], attn2.shape[3])
result2 = np.array(attn2)
np.savetxt(
    '/root/autodl-tmp/result1/attnresult.txt',
    result2)

The first question is whether the way I save data in this way is correct. The second question is that I assign the obtained Attn weight to the point cloud of XYZs. I think the points of XYZs contain the coordinates of the points in Attn and correspond to each other. But you said that Attn should be assigned to the point cloud of 2048 points in the input. However, there are 2048 point clouds in the input, and only 64 point clouds in the Attn. Is this OK? Do you mean to directly merge the two? I'm very interested in your point cloud sequence research and there are still problems in visualization. Thank you for your patience

hehefan commented 2 years ago

Hi,

It is not so complicated.

Suppose P with shape N x 3 is the input point cloud and Q with shape M x 3 is the downsampled points, where M < N. The downsampled points are with attention weights A with shape M x 1. Because there are multiple Transformer layers, you may select an intermediate layer. Also, because there are multiple heads, you need to select one head.

Then, what you need only to do is to transfer the attention weights A to the input point cloud based on P and Q. To do so, I show you a very simple code,

                dist = np.expand_dims(P, 1) - np.expand_dims(Q, 0)
                dist = np.sum(dist*dist, -1)
                idx = np.argmin(dist, 1)
                attn = A[idx]

The attn is exactly what you want. You may also use the feature propagation operation to do this.

weiyutao886 commented 2 years ago

Thank you very much for the code you provided. I can initially visualize Attn. At present, there is a problem that I don't understand. The number of input frames is 24. Because the method you mentioned here is based on the temporal stripe, the number of frames of the sampled point is 12. How can I better match the number of input frames with the number of frames of the down sampling point? For example, the first frame of the input point corresponds to the first frame of the down sampling point. What about the second frame and the third frame of the input point?

weiyutao886 commented 2 years ago

Here, I save the input [14,24,1024,3] as a file [N, 3], where each bitchsize corresponds to 24 frames, the sampling point [14,12,64,3] is saved as [M, 3], where each bitchsize corresponds to 12 frames. How can I match the input of each frame with the down sampling point when I select it in the visualization Here are some results of my visualization. What are the reasons for the gap between the results and yours image image image thanks for your help

create7859 commented 7 months ago

Hi,

It is not so complicated.

Suppose P with shape N x 3 is the input point cloud and Q with shape M x 3 is the downsampled points, where M < N. The downsampled points are with attention weights A with shape M x 1. Because there are multiple Transformer layers, you may select an intermediate layer. Also, because there are multiple heads, you need to select one head.

Then, what you need only to do is to transfer the attention weights A to the input point cloud based on P and Q. To do so, I show you a very simple code,

                dist = np.expand_dims(P, 1) - np.expand_dims(Q, 0)
                dist = np.sum(dist*dist, -1)
                idx = np.argmin(dist, 1)
                attn = A[idx]

The attn is exactly what you want. You may also use the feature propagation operation to do this.

Can you provide a detailed explanation of the process by which the attention weight becomes [M, 1] in size? I'm curious about how to obtain a single weight from the softmax matrix, which initially had a size of [frame_length token per frame, frame_length token per frame].

whu-lyh commented 3 months ago

any new progress about how to visualize the attention?

create7859 commented 3 months ago

I'm not sure if the weight described by the authors is obtained in this way, but I averaged over the dimension in the attention matrix where softmax was not applied to get the (M, 1) weight they mentioned, and the visualization result was quite meaningful. However, I used a different dataset and task, only adopting the 4D conv structure.

whu-lyh commented 3 months ago

thanks