huggingface / datasets

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
https://huggingface.co/docs/datasets
Apache License 2.0
19.31k stars 2.7k forks source link

add MLDoc dataset #517

Open jxmorris12 opened 4 years ago

jxmorris12 commented 4 years ago

Hi,

I am recommending that someone add MLDoc, a multilingual news topic classification dataset.

Looks like the dataset contains news stories in multiple languages that can be classified into four hierarchical groups: CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social) and MCAT (Markets). There are 13 languages: Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish

GuillemGSubies commented 3 years ago

Any updates on this?

albertvillanova commented 3 years ago

This request is still an open issue waiting to be addressed by any community member, @GuillemGSubies.