mlfoundations / MINT-1T

MINT-1T: A one trillion token multimodal interleaved dataset.
780 stars 20 forks source link

Why is OBELICS generally better than MINT-1T (HTML)? #4

Closed lijinginfo closed 2 months ago

lijinginfo commented 5 months ago

Why is OBELICS generally better than MINT-1T (HTML)? Is the main advantage of MINT-1T over OBELICS primarily related to PDFs?

Snipaste_2024-06-24_19-42-27 Snipaste_2024-06-24_19-46-45
anas-awadalla commented 4 months ago

Hi @lijinginfo! Thanks for your interest and sorry it took so long to get to this. I was waiting for us to release our latest report so I can share more solid evidence as to why MINT-1T (HTML) is closer to OBELICS.

One thing I will say is that the results in v1 paper were flawed in a few ways: 1) we did not ablate demonstration separators for the in-context learning prompts. 2) We did not include vqav2 evals in the "a" plot.

We reran results for MINT-1T (HTML) and OBELICS using both an XGen-MM and Idefics2 architectures and find that performance is very close. Here are some numbers:

html-perf-barplots

We also reran experiments for the HTML + PDF from MINT-1T and now see much bigger performance gap! (Note this is just for xgen-mm architecture experiments) mint_shot_scaling_twitter

Thanks!

anas-awadalla commented 2 months ago

Closing this but feel free to reopen if you have any more questions!