huggingface / OBELICS

Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.
https://huggingface.co/datasets/HuggingFaceM4/OBELICS
Apache License 2.0
171 stars 9 forks source link

Training Details #1

Closed vateye closed 8 months ago

vateye commented 11 months ago

Hi, I have a question about the pre-training stage for OBELISC. Did you use the whole document for pre-training or just the image and its paired text for training?

HugoLaurencon commented 10 months ago

Hi @vateye thanks for your question. We used the whole document when it was possible. However, some documents contain a number of tokens higher than the maximum allowed by the pre-trained LM, so in this case we had to truncate.