I downloaded some arXiv and PDF data from the MINT-1T dataset, but I found that the data in the tar file is organized in pairs of tiff and json files, rather than interleaved data like in MMC4. Where did I go wrong?
The tiff contains many frames per document (if that document has more than one image). A document and associated image are tiff and json files with the same base name.
I downloaded some arXiv and PDF data from the MINT-1T dataset, but I found that the data in the tar file is organized in pairs of tiff and json files, rather than interleaved data like in MMC4. Where did I go wrong?