Closed jasperzhong closed 6 years ago
Hi, I haven't tried it on Chinese data but I'm assuming a special preprocessing step is required for Chinese characters as they aren't easily separable to characters and words. Also glove word vectors should be changed to Chinese word glove.
Yes, spaCy only supports to separate Chinese words. But it does't support other nlp basic functions, such as POS and NER. However, StanfordCoreNLP can do this. I wonder whether this model has used other features that needs nlp functions beyond separating words. If so, I suppose I should change StanfordCoreNLP. And, yes, I used word2vec in Chinese from here.
@zhongyuchen Probably it is better to use FastText than w2v :) https://fasttext.cc/docs/en/crawl-vectors.html
I wonder if there are any Chinese data sets with same format available for use. I got the dureader data set from here, but found the answer is not a span from context.
@yangyuji12 You can try to translate SQuAD paragraphs and questions to Chinese and then find an answer in paragraph (it could be the most difficult part). The authors described this approach in "Data Augmentation by backtranslation" section
I just change nlp = spacy.blank("en") to nlp = spacy.blank("zh") Is that ok?