add MLDoc dataset - Githubissues

jxmorris12 commented 4 years ago

Hi,

I am recommending that someone add MLDoc, a multilingual news topic classification dataset.

Here's a link to the Github: https://github.com/facebookresearch/MLDoc
and the paper: http://www.lrec-conf.org/proceedings/lrec2018/pdf/658.pdf

Looks like the dataset contains news stories in multiple languages that can be classified into four hierarchical groups: CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social) and MCAT (Markets). There are 13 languages: Dutch, French, German, Chinese, Japanese, Russian, Portuguese, Spanish, Latin American Spanish, Italian, Danish, Norwegian, and Swedish

GuillemGSubies commented 3 years ago

Any updates on this?

albertvillanova commented 3 years ago

This request is still an open issue waiting to be addressed by any community member, @GuillemGSubies.

huggingface / datasets

add MLDoc dataset #517