What Chinese dataset are available for training decaNLP?

salesforce / decaNLP

The Natural Language Decathlon: A Multitask Challenge for NLP

BSD 3-Clause "New" or "Revised" License

2.34k stars 471 forks source link

What Chinese dataset are available for training decaNLP? #29

Closed threefoldo closed 6 years ago

threefoldo commented 6 years ago

It's a little difficult to find Chinese dataset suitable for training decaNLP. Right now, all I have is: 1, douban movie review for sentiment analysis; 2, webqa from baidu. Is there any other data which can be used for training?

bmccann commented 6 years ago

Well, any data you can find should be easy enough to include, but I myself am not familiar with the landscape of Chinese datasets. We can keep this issue open for a while to see if anyone else watching might know of some good ones.

dfenglei commented 6 years ago

The follow is paper for DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications https://arxiv.org/abs/1711.05073

threefoldo commented 6 years ago

Thanks. I had already studied this data. Among 90k questions, most answers are long sentences, not short phrases extracted from input sentences. Maybe it could be preprocessed somehow before sending to decaNLP.