huggingface / OBELICS

Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.
https://huggingface.co/datasets/HuggingFaceM4/OBELICS
Apache License 2.0
171 stars 9 forks source link

Which folder to use? #2

Closed mckinziebrandon closed 9 months ago

mckinziebrandon commented 10 months ago

Hi, excellent work!

I've read the README a few times but I'm still not sure which directory (obelics or build_obelics) should be used to create the dataset. They both seem to do similar things in similar ways.

Also, it would be ideal if there was a set of instructions, rather than a directory of scripts each with different arguments, for obtaining the dataset.

HugoLaurencon commented 9 months ago

Hi @mckinziebrandon, thanks for your interest.

You should use build_obelics to obtain the scripts used for the creation of the dataset. These scripts call the methods defined in the folder obelics.

mckinziebrandon commented 9 months ago

Great, thanks!