mchen24 / iclr2017

Doc2VecC from the paper "Efficient Vector Representation for Documents through Corruption"
Apache License 2.0
188 stars 49 forks source link

Document classification dataset for replication #6

Open koustuvsinha opened 6 years ago

koustuvsinha commented 6 years ago

Hi, Thanks for this amazing paper! I was wondering if you can provide the document classification subset you collected with 300,000 documents and 100 categories, would like to use the same dataset for replication of this paper as well as some baselines. Thanks!

mchen24 commented 6 years ago

Thanks for your interest. I don’t have access to the preprocessed set at this moment. You might need to wait for a week. Here’s a thread discussing how to generate the cleaner text from Wikipedia data dump if you can’t wait. It should be straightforward to run the code on processed data.

On Wed, Dec 6, 2017 at 1:30 PM Koustuv Sinha notifications@github.com wrote:

Hi, Thanks for this amazing paper! I was wondering if you can provide the document classification subset you collected with 300,000 documents and 100 categories, would like to use the same dataset for replication of this paper as well as some baselines. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mchen24/iclr2017/issues/6, or mute the thread https://github.com/notifications/unsubscribe-auth/AWLPJmwx6BnFbHC83b4WE-2Z6o-hesskks5s9wdhgaJpZM4Q4niD .

koustuvsinha commented 6 years ago

Thanks, if you can just provide the page names and their respective categories then I can extract and preprocess the text myself.

mchen24 commented 6 years ago

Here's the processed wiki subset: http://www.cse.wustl.edu/~mchen/code/enwiki100.tar.gz