minimalparts / nonce2vec

Incremental learning of word embeddings with context informativeness.
MIT License
95 stars 15 forks source link

hi~which version of wikipedia corpus was used? #2

Closed willanxywc closed 6 years ago

willanxywc commented 6 years ago

Which version of wikipedia corpus do you use to pre-train the N2V? It's not clearly stated in the paper but there are so many versions or copies wikipedia available.

minimalparts commented 6 years ago

We used a rather old dump from November 2015, which we had lying around already pre-processed. I don't think there should be too much variation across dumps, but if you really want to replicate everything from scratch, I think I still have the tokenised sentences somewhere and could get them to you (contact me offline). Also, the gensim model we trained is available here.

willanxywc commented 6 years ago

Thanks~