Open kaihuchen opened 6 years ago
Yes, there should be only those modifications you mentioned. The most tricky part should be the Chinese SpaCy models, which are not officially supported.
@hitvoice Much appreciated for the confirmation! One more question: given that the Chinese language does not have natural word boundaries, when using DrQA with a Chinese language dataset, does it make any difference for DrQA whether the dataset is tokenized first (i.e., 分词,using a tool such as Jieba)? Or can I assume that since SpaCy kind of does tokenization in its own way, so I actually don't have to anything specially in this respect?
You should tokenize your Chinese data first. Prepare your data as "这是 一个 分词 后 的 样例" (separate tokens by spaces) and provide corresponding POS and NER tags. This not an easy copy-and-paste. A lot of work and modifications should be done for Chinese support.
Is it expected that this code can be applied to a Chinese language dataset with only minor changes?
I understand that I will need to provide the following:
Would very much appreciate any insights if there is any known reasons why this is not supposed to work.