train Wikipedia's Chinese data sets for doc2vec

jhlau / doc2vec

Python scripts for training/testing paragraph vectors

Apache License 2.0

640 stars 191 forks source link

train Wikipedia's Chinese data sets for doc2vec #5

Closed yuquanle closed 7 years ago

yuquanle commented 7 years ago

I have trained Wikipedia Chinese word2vec.I would like to use this method to train Wikipedia's Chinese data sets in ubuntu, what should I do?

imtypist commented 7 years ago

+1 it will be killed when I train wikipedia's Chinese data,Maybe ran out of RAM

jhlau commented 7 years ago

You'll need to preprocess the Chinese text and tokenise the words first. The current script (doc2vec.py) currently calls to TaggedLineDocument: docs = g.doc2vec.TaggedLineDocument(train_corpus)

Which only tokenises words based on white space, which will not work for Chinese. Anyway this is not an issue for the program, if you have more questions on how to use gensim's doc2vec for Chinese text, please ask on their forum.