mlfoundations / MINT-1T

MINT-1T: A one trillion token multimodal interleaved dataset.
770 stars 20 forks source link

How to align image data in json file with tiff image? #9

Closed chenyehuang closed 3 months ago

chenyehuang commented 3 months ago

I have done experiments and found that the order of image data in json is not one-to-one aligned with tiff images, and I don't know what data the sha256 of the image in json is converted from. Could you please answer my doubts?

anas-awadalla commented 3 months ago

Hello! Which subset of the data are you referring to? We are aware of a problem with the pdf subset of the data where sometimes the images on the same page can be ordered differently from each other (and yes there is a bug with the hash in this scenario, we will see if we can fix it). You should still be able to map from tiff to json though as the images appear in the order used in texts list.

chenyehuang commented 3 months ago

Thanks for your reply. I did have a problem with the pdf dataset. In addition, is the sha256 hash value converted from the image data? Or is it added with other information? Why is the result different when I convert the image to sha256 than in the json file?

anas-awadalla commented 3 months ago

It is from images but pre conversion to tiff format so unfortunately they now don’t match :(. It was a mistake on our end. Thanks for bringing this up though! I will update the readme and maybe ping here if I get the chance to fix it.

chenyehuang commented 3 months ago

Thank you for answering my questions.