linanqiu / reddit-dataset

Dataset of threads and comments from reddit
172 stars 40 forks source link

Duplicates in each file #1

Open PBobovsky opened 6 years ago

PBobovsky commented 6 years ago

After a brief look at the files, it seems that each file consists of only about 2500 comments, that have been multiplied 50 times. I do not see this mentioned anywhere and in fact it's a huge dealbreaker if you try to use this dump for anything semi-serious.

LoganBeaudoin commented 5 years ago

NoRepeats.zip I wrote a Bash script to remove all the repeat comments. Script can be found inside the zip file, in addition to another script to remove all blank comments.