Open koustuvsinha opened 6 years ago
Thanks for your interest. I don’t have access to the preprocessed set at this moment. You might need to wait for a week. Here’s a thread discussing how to generate the cleaner text from Wikipedia data dump if you can’t wait. It should be straightforward to run the code on processed data.
On Wed, Dec 6, 2017 at 1:30 PM Koustuv Sinha notifications@github.com wrote:
Hi, Thanks for this amazing paper! I was wondering if you can provide the document classification subset you collected with 300,000 documents and 100 categories, would like to use the same dataset for replication of this paper as well as some baselines. Thanks!
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mchen24/iclr2017/issues/6, or mute the thread https://github.com/notifications/unsubscribe-auth/AWLPJmwx6BnFbHC83b4WE-2Z6o-hesskks5s9wdhgaJpZM4Q4niD .
Thanks, if you can just provide the page names and their respective categories then I can extract and preprocess the text myself.
Here's the processed wiki subset: http://www.cse.wustl.edu/~mchen/code/enwiki100.tar.gz
Hi, Thanks for this amazing paper! I was wondering if you can provide the document classification subset you collected with 300,000 documents and 100 categories, would like to use the same dataset for replication of this paper as well as some baselines. Thanks!