mlfoundations / MINT-1T

MINT-1T: A one trillion token multimodal interleaved dataset.
730 stars 20 forks source link

Why is MINT data interleaved? #10

Closed nhsjgczryf closed 1 month ago

nhsjgczryf commented 1 month ago

I downloaded some arXiv and PDF data from the MINT-1T dataset, but I found that the data in the tar file is organized in pairs of tiff and json files, rather than interleaved data like in MMC4. Where did I go wrong?

anas-awadalla commented 1 month ago

The tiff contains many frames per document (if that document has more than one image). A document and associated image are tiff and json files with the same base name.