mlfoundations / MINT-1T

MINT-1T: A one trillion token multimodal interleaved dataset.
770 stars 20 forks source link

would you share your data processing code #11

Closed MarStarck closed 1 month ago

MarStarck commented 2 months ago

like pdf pipeline?

anas-awadalla commented 2 months ago

Yes I have a plan to do this but the code needs some reworking before sharing publicly. I will keep you updated on this.

anas-awadalla commented 1 month ago

I have uploaded the code for processing a PDF here. Hope this is helpful.