Open rahulvenugopal opened 4 years ago
Interesting dataset!
The datasets are quite big 4 million rows. I've created a repo that shows the download and processing. But because of the filesize I've created a open science framework project that contains the original .7z file, an RDS file, a FST file and an sqlite dataset. See https://osf.io/cphtu/ I'm looking to see if I can zip the csv file to github too. Maybe that is small enough
it was not small enough. still 194 mb. I guess we can make the dataset smaller if we extract the DOIs and pdf links. I did not check but they should be the same. we can replace the DOIs with a small identifyer that will bring the filesize of the main file down. It would mean 2 datasets in stead of one.
you can enrich the dataset by using the https://docs.ropensci.org/rcrossref package to retrieve journals and titles etc. Maybe we can attempt to answer questions as: are retrieved DOIs cited more?
or use altmetrics https://docs.ropensci.org/rAltmetric to see what happens to these papers. A big issue is that we don't have the counterfactuals, articles not searched for on sci-hub in that period. .
Dear Team,
First three years of download statistics data from sci-hub. It would be interesting to see the journals which are downloaded most how much of open access papers get downloaded just because the interface is so easy and efficient and lot more. Data link: https://t.co/7LW7Xs3heB?amp=1
Thank you