huggingface / OBELICS

Code used for the creation of OBELICS, an open, massive and curated collection of interleaved image-text web documents, containing 141M documents, 115B text tokens and 353M images.
https://huggingface.co/datasets/HuggingFaceM4/OBELICS
Apache License 2.0
171 stars 9 forks source link

How to use LDA for topic modeling #12

Open jrryzh opened 1 month ago

jrryzh commented 1 month ago

Thanks for your work again! In the paper the topic modeling of OBELICS is implemented using LDA, and I am wondering what is the specific LDA model was used, what setting was used to train the model, and most importantly, how the topic was derived from the key words and weights(like using LLMs)? Thank you for answering!

HugoLaurencon commented 2 weeks ago

We used this implementation https://mimno.github.io/Mallet/topics. I don't remember the parameters but it should be the default ones. Yes we used ChatGPT to generate the topic from the key words!