huggingface / OBELICS

Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.
https://huggingface.co/datasets/HuggingFaceM4/OBELICS
Apache License 2.0
171 stars 9 forks source link

Search engine over the training data #5

Open aleSuglia opened 7 months ago

aleSuglia commented 7 months ago

Hello all,

Thanks a lot for releasing this dataset. I was wondering whether you were planning to release any form of "search engine" over your dataset which is something similar in spirit to what other people started doing for LLM data (e.g., ROOTS Search tool).

Many thanks, Alessandro

HugoLaurencon commented 1 month ago

No but we have a Nomic map: https://atlas.nomic.ai/map/obelics