Document classification dataset for replication

koustuvsinha commented 6 years ago

Hi, Thanks for this amazing paper! I was wondering if you can provide the document classification subset you collected with 300,000 documents and 100 categories, would like to use the same dataset for replication of this paper as well as some baselines. Thanks!

mchen24 commented 6 years ago

Thanks for your interest. I don’t have access to the preprocessed set at this moment. You might need to wait for a week. Here’s a thread discussing how to generate the cleaner text from Wikipedia data dump if you can’t wait. It should be straightforward to run the code on processed data.

On Wed, Dec 6, 2017 at 1:30 PM Koustuv Sinha notifications@github.com wrote:

Hi, Thanks for this amazing paper! I was wondering if you can provide the document classification subset you collected with 300,000 documents and 100 categories, would like to use the same dataset for replication of this paper as well as some baselines. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mchen24/iclr2017/issues/6, or mute the thread https://github.com/notifications/unsubscribe-auth/AWLPJmwx6BnFbHC83b4WE-2Z6o-hesskks5s9wdhgaJpZM4Q4niD .

koustuvsinha commented 6 years ago

Thanks, if you can just provide the page names and their respective categories then I can extract and preprocess the text myself.

mchen24 commented 6 years ago

Here's the processed wiki subset: http://www.cse.wustl.edu/~mchen/code/enwiki100.tar.gz

mchen24 / iclr2017

Document classification dataset for replication #6