hitvoice / DrQA

A pytorch implementation of Reading Wikipedia to Answer Open-Domain Questions.
401 stars 109 forks source link

Using DrQA on an Chinese dataset #22

Open kaihuchen opened 6 years ago

kaihuchen commented 6 years ago

Is it expected that this code can be applied to a Chinese language dataset with only minor changes?

I understand that I will need to provide the following:

Would very much appreciate any insights if there is any known reasons why this is not supposed to work.

hitvoice commented 6 years ago

Yes, there should be only those modifications you mentioned. The most tricky part should be the Chinese SpaCy models, which are not officially supported.

kaihuchen commented 6 years ago

@hitvoice Much appreciated for the confirmation! One more question: given that the Chinese language does not have natural word boundaries, when using DrQA with a Chinese language dataset, does it make any difference for DrQA whether the dataset is tokenized first (i.e., 分词,using a tool such as Jieba)? Or can I assume that since SpaCy kind of does tokenization in its own way, so I actually don't have to anything specially in this respect?

hitvoice commented 6 years ago

You should tokenize your Chinese data first. Prepare your data as "这是 一个 分词 后 的 样例" (separate tokens by spaces) and provide corresponding POS and NER tags. This not an easy copy-and-paste. A lot of work and modifications should be done for Chinese support.