iamrakesh28 / Video-Prediction

Implementation of Transformer Encoder Decoder Architecture for Video Predictions
39 stars 11 forks source link

the Channel dim of a color picture #5

Open Angry-Echo opened 1 year ago

Angry-Echo commented 1 year ago

Hello~ I am studying your code and i have a question about how the model handle the color image due to I can't find the RGB Channel when frame sequence input into the model. In the _multi_headattention.py, at the beginning of the call method (after self.wq(q), and i know the self.wq is a conv_layer), your comment says:#(batch_size, num_heads, seq_len_q, rows, cols, depth), where is the channel-dim? The dimension meaning of the six i understand is: seq_len_q is the length of the frame sequence; num_heads × depth = d_model; rows is the H of image; cols is the W of image)

Sincerely hope that you can answer my doubts and if you do not mind, can i ask you for some knowledge about the field of Video Prediction? I am trying to do some research about predicting image sequence with Transformer

Angry-Echo commented 1 year ago

Oh!I guess the depth is channel, is that right ?

Angry-Echo commented 1 year ago

But i still have a same question as the another issue, the Conv layer requires [batch_size, rows, cols, depth], how can the additional dim seqlen input into the Conv layer?