mu-cai / matryoshka-mm

Matryoshka Multimodal Models
https://matryoshka-mm.github.io/
Apache License 2.0
72 stars 4 forks source link

[Question] About number of Oracle tokens and heuristics based sampling baselines #2

Closed NIneeeeeem closed 3 months ago

NIneeeeeem commented 3 months ago

Question

Hello, Thank you for open sourcing such exhilarating work!

I have some questions regarding the article: Firstly, in Table 2, the Oracle's token number and Performance are presented. I am curious about how these token numbers and their performance are determined. Is it by statistically averaging the correct and minimal number of tokens for each answer? Secondly, I've noticed that in Table 4, the performance differences between $M^3$ and various heuristics-based sampling baselines are compared, yet in the code, the matryoshka_vis_token_process function seems to only conduct an average pooling operation, yet it is significantly more powerful than Average Pooling. Additionally, I think the title of Table 1 should be "Image Understanding" instead of "Video Understanding."

mu-cai commented 3 months ago

Great questions!

  1. Is it by statistically averaging the correct and minimal number of tokens for each answer? -> Yes, for sample who own wrong predictions overall scales, we use 1 tokens for those sample.
  2. The different in M3 and Average Pooling in Table 4 is that: M3 will train the whole LMM using averaging pooling, while Average Pooling in Table 4 is a heuristics at inference time.
  3. I think the title of Table 1 should be "Image Understanding" instead of "Video Understanding." -> Thanks! We will update in next version!
NIneeeeeem commented 3 months ago

Thank you for the explanation!