jy0205 / LaVIT

LaVIT: Empower the Large Language Model to Understand and Generate Visual Content
Other
495 stars 25 forks source link

few questions #35

Open ProjectDisR opened 2 weeks ago

ProjectDisR commented 2 weeks ago

https://github.com/jy0205/LaVIT/blob/a0f6ef08d888be5b0b2a99ff66de90f9f3d7f4d0/VideoLaVIT/models/transform.py#L138 since start_indexs are obtained in line 131, why still need this part of code for selecting? and following this, why use 12 rather than num_frames (24 in line 104)?

jy0205 commented 2 weeks ago

Thanks for your attention! 1) Since we use the mpeg4-part2 protocol to extract the motion vector, where 11 P frames follow each I frame, the code in line 138 aims to ensure the pattern of 1 I + 11 P frames. We filter and skip the frames that do not follow this pattern. 2) We use 24 frames as one unit to encode motions, i.e., two groups of "1 I + 11 P frames"

ProjectDisR commented 2 weeks ago

Thanks for the reply.

does this mean from line 131, the I frames may not always have the following 11 P frames, so need further filtering in line 138? not familar with mpeg4-part2, does it must require the "1 I + 11 P" pattern? still dont understand why filtering is necessary here or what would happen without filtering.

jy0205 commented 1 week ago

Only the P frame has motion vectors, the motion vectors of the I frame are zeros. We train our motion tokenizer following the "1 I + 11 P" pattern. Sometimes, part of frames of mpeg4-part2 re-encoded video may not always have this pattern. During inference, we filter them to keep it the same as the training procedure.