localminimum / QANet

A Tensorflow implementation of QANet for machine reading comprehension
MIT License
982 stars 310 forks source link

Can it support Chinese? #28

Closed jasperzhong closed 6 years ago

jasperzhong commented 6 years ago

I just change nlp = spacy.blank("en") to nlp = spacy.blank("zh") Is that ok?

localminimum commented 6 years ago

Hi, I haven't tried it on Chinese data but I'm assuming a special preprocessing step is required for Chinese characters as they aren't easily separable to characters and words. Also glove word vectors should be changed to Chinese word glove.

jasperzhong commented 6 years ago

Yes, spaCy only supports to separate Chinese words. But it does't support other nlp basic functions, such as POS and NER. However, StanfordCoreNLP can do this. I wonder whether this model has used other features that needs nlp functions beyond separating words. If so, I suppose I should change StanfordCoreNLP. And, yes, I used word2vec in Chinese from here.

maciejbiesek commented 6 years ago

@zhongyuchen Probably it is better to use FastText than w2v :) https://fasttext.cc/docs/en/crawl-vectors.html

zhixiaochuan12 commented 6 years ago

I wonder if there are any Chinese data sets with same format available for use. I got the dureader data set from here, but found the answer is not a span from context.

maciejbiesek commented 6 years ago

@yangyuji12 You can try to translate SQuAD paragraphs and questions to Chinese and then find an answer in paragraph (it could be the most difficult part). The authors described this approach in "Data Augmentation by backtranslation" section