Hello, Thank you for open sourcing such exhilarating work!
I have some questions regarding the article:
Firstly, in Table 2, the Oracle's token number and Performance are presented. I am curious about how these token numbers and their performance are determined. Is it by statistically averaging the correct and minimal number of tokens for each answer?
Secondly, I've noticed that in Table 4, the performance differences between $M^3$ and various heuristics-based sampling baselines are compared, yet in the code, the matryoshka_vis_token_process function seems to only conduct an average pooling operation, yet it is significantly more powerful than Average Pooling.
Additionally, I think the title of Table 1 should be "Image Understanding" instead of "Video Understanding."
Is it by statistically averaging the correct and minimal number of tokens for each answer? -> Yes, for sample who own wrong predictions overall scales, we use 1 tokens for those sample.
The different in M3 and Average Pooling in Table 4 is that: M3 will train the whole LMM using averaging pooling, while Average Pooling in Table 4 is a heuristics at inference time.
I think the title of Table 1 should be "Image Understanding" instead of "Video Understanding." -> Thanks! We will update in next version!
Question
Hello, Thank you for open sourcing such exhilarating work!
I have some questions regarding the article: Firstly, in Table 2, the Oracle's token number and Performance are presented. I am curious about how these token numbers and their performance are determined. Is it by statistically averaging the correct and minimal number of tokens for each answer? Secondly, I've noticed that in Table 4, the performance differences between $M^3$ and various heuristics-based sampling baselines are compared, yet in the code, the
matryoshka_vis_token_process
function seems to only conduct an average pooling operation, yet it is significantly more powerful than Average Pooling. Additionally, I think the title of Table 1 should be "Image Understanding" instead of "Video Understanding."