Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.
Thanks a lot for releasing this dataset. I was wondering whether you were planning to release any form of "search engine" over your dataset which is something similar in spirit to what other people started doing for LLM data (e.g., ROOTS Search tool).
Hello all,
Thanks a lot for releasing this dataset. I was wondering whether you were planning to release any form of "search engine" over your dataset which is something similar in spirit to what other people started doing for LLM data (e.g., ROOTS Search tool).
Many thanks, Alessandro