huggingface / OBELICS

Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.
https://huggingface.co/datasets/HuggingFaceM4/OBELICS
Apache License 2.0
171 stars 9 forks source link

Releasing trained topic models? #8

Open vishaal27 opened 2 months ago

vishaal27 commented 2 months ago

Hey, thanks for the great work -- do you plan to release your trained LDA model for the analysis in sec 4.2? Thanks!

HugoLaurencon commented 1 month ago

Hi thanks! I don't think I still have it, but it wasn't really long to train and I ran it on my personal computer for 1 day, so it should be reproducible