rfordatascience / tidytuesday

Official repo for the #tidytuesday project
Creative Commons Zero v1.0 Universal
6.84k stars 2.4k forks source link

Sci-hub download statistics #169

Open rahulvenugopal opened 4 years ago

rahulvenugopal commented 4 years ago

Dear Team,

First three years of download statistics data from sci-hub. It would be interesting to see the journals which are downloaded most how much of open access papers get downloaded just because the interface is so easy and efficient and lot more. Data link: https://t.co/7LW7Xs3heB?amp=1

Thank you

peranti commented 4 years ago

Interesting dataset!

RMHogervorst commented 4 years ago

The datasets are quite big 4 million rows. I've created a repo that shows the download and processing. But because of the filesize I've created a open science framework project that contains the original .7z file, an RDS file, a FST file and an sqlite dataset. See https://osf.io/cphtu/ I'm looking to see if I can zip the csv file to github too. Maybe that is small enough

RMHogervorst commented 4 years ago

it was not small enough. still 194 mb. I guess we can make the dataset smaller if we extract the DOIs and pdf links. I did not check but they should be the same. we can replace the DOIs with a small identifyer that will bring the filesize of the main file down. It would mean 2 datasets in stead of one.

RMHogervorst commented 4 years ago

you can enrich the dataset by using the https://docs.ropensci.org/rcrossref package to retrieve journals and titles etc. Maybe we can attempt to answer questions as: are retrieved DOIs cited more?

RMHogervorst commented 4 years ago

or use altmetrics https://docs.ropensci.org/rAltmetric to see what happens to these papers. A big issue is that we don't have the counterfactuals, articles not searched for on sci-hub in that period. .