Closed yuquanle closed 7 years ago
+1 it will be killed when I train wikipedia's Chinese data,Maybe ran out of RAM
You'll need to preprocess the Chinese text and tokenise the words first. The current script (doc2vec.py) currently calls to TaggedLineDocument: docs = g.doc2vec.TaggedLineDocument(train_corpus)
Which only tokenises words based on white space, which will not work for Chinese. Anyway this is not an issue for the program, if you have more questions on how to use gensim's doc2vec for Chinese text, please ask on their forum.
I have trained Wikipedia Chinese word2vec.I would like to use this method to train Wikipedia's Chinese data sets in ubuntu, what should I do?