Question about RCV1 dataset

morningmoni / HiLAP

Code for paper "Hierarchical Text Classification with Reinforced Label Assignment" EMNLP 2019

140 stars 34 forks source link

Question about RCV1 dataset #8

Closed erichen510 closed 4 years ago

erichen510 commented 4 years ago

When I did the experiment on RCV1, while textcnn reach the similar result in your paper. The performance of the BERT model got a big gap, about 10% lower, comparing with another unpublished paper's result.

For the file lyrl2004_tokens_train.dat.gz, do I need some additional preprocessing since the texts in the file are unreadable.

Thanks~

morningmoni commented 4 years ago

lyrl2004_tokens_train.dat.gz looks like the processed file for RCV1. I think you can get the raw dataset here: https://trec.nist.gov/data/reuters/reuters.html

erichen510 commented 4 years ago

Thank you so much!