sudo-Boris / mr-Blip

Official Implementation of "The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval"
BSD 3-Clause "New" or "Revised" License
33 stars 0 forks source link

Annoying numbers #2

Closed pangzss closed 2 months ago

pangzss commented 2 months ago

Hi, thank you for the good work and the code release!

While going through the code, I found some operations related to "annoying" numbers. https://github.com/sudo-Boris/mr-Blip/blob/23b6970b0c1166dfc2676fce7368844181d23585/lavis/models/blip2_mr_models/blip2_mr.py#L123C11-L129

Could you provide an explanation about why numbers tokenized into 2 tokens are "annoying"?

Thank you very much.

sudo-Boris commented 2 months ago

Hi @pangzss,

thank you for your interest in Mr. BLIP!

That is a good question. In general, we have found that by designing the context to be as consistent and "straightforward" as possible for the model we achieve the best performance.

Having numbers tokenized as two tokens all of a sudden changes the pattern of the input that the model learns to rely on. This is especially true for the case of first concatenating all frames together and then concatenating the timestamps (i.e. not interleaving the data) - $f_1, f_2, ..., f_F, t_1, t_2, ..., t_F$. But in the interleaved setting, this shift of course happens as well, even shifting the positions of the frames.

The model seems to perform better if it can learn to associate frame 1 ($idx_{f1} = 1$) to timestamp 1 ($idx{t_1} = F + 1$). If now a number is tokenized as two tokens, the model has to first learn to associate a frame with two tokens and has to consider the shift of all following timestamps, i.e. if $t2$ ($idx{t2} = F + 2$) has two tokens, timestamp 3 will not be at the index it used to be ($idx{t3} = F + 4$ instead of $idx{t_3} = F + 3$).

This number tokenization is very random because the T5 tokenizer leverages a normal BPE. Up to 150, there are three numbers that are tokenized as two tokens, 112, 128, and 135. In that case, this little workaround with the "annoying" numbers logic helps. For datasets like ActivityNet, where the videos can get much longer, we don't really use this annoying numbers trick beyond 200, because after 200, there are way too many "annoying" numbers.

In general, number tokenization was a long-lasting issue of LLMs. Especially for arithmetic operations! I can recommend having a look at this interesting recent blog post showing the severe effects of number-tokenization design.

In the end, if we were to leverage a different LLM backbone with a different tokenizer (a better one w.r.t. number-tokenization), we wouldn't need this little workaround.

I hope this makes sense! :)

pangzss commented 2 months ago

Thank you for the thorough explanation! The reference is very helpful as well!