huggingface / OBELICS

Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.
https://huggingface.co/datasets/HuggingFaceM4/OBELICS
Apache License 2.0
171 stars 9 forks source link

Missing TextMediaPairsExtractor from the repo #7

Open kckishan opened 2 months ago

kckishan commented 2 months ago

Hi, Can you share TextMediaPairsExtractor that you are referring in obelics/visualization/global_visualization.py?

HugoLaurencon commented 1 month ago

Hi, it was not very useful in the end so I would recommend commenting the parts where it's mentioned in global_visualization