Initial set of query DOIs from State of OA study

dhimmel commented 6 years ago

A good starting set of DOIs would be those from the State of OA study, since there will be ~ 300,000 DOIs and will allow us to compare Library access to oaDOI and Sci-Hub access. I'll comment once I've generated this list.

dhimmel commented 6 years ago

@publicus see state-of-oa-dois.tsv.xz with 290,120 DOIs that were assessed in the State of OA study. The oadoi_color columns indicates access status according to the oaDOI utility. For prototyping, you can work with a subset of these DOIs... but once we have info on all of them, that will be enough for several analyses.

jglev commented 6 years ago

Great, thank you!

I have two methods at this point that should work for resolving the DOIs to the University's holdings. The first way works right now, but would be harder to use for other universities/libraries, and only allows one DOI to be resolved at a time. The second uses an OCLC API, and so could work for many more libraries, and should also allow multiple DOI lookup at once. However, that second method isn't working as its documentation describes it should. Thus, I'm waiting to hear back from the OCLC Support team before deciding which method to use.

jglev commented 6 years ago

Also, @dhimmel, to confirm, from the state-of-oa dataset, you're suggesting just looking at the Closed (and possibly Bronze) DOIs, correct? (Of the 290,120 rows in the dataset, I see 182,804 that are Closed, and 226,690 that are either Closed or Bronze).

dhimmel commented 6 years ago

I'm waiting to hear back from the OCLC Support team before deciding which method to use.

Cool. You can always open Work In Progress pull requests before an analysis is complete if you'd like to start getting feedback.

you're suggesting just looking at the Closed (and possibly Bronze) DOIs, correct?

I was thinking we'd query all 290,120 and filter at analysis time if desired. We don't yet know how the two methods will perform on these various types of articles. My guess is that Universities subscribe to subscription journals which will includes all articles that are closed, green, bronze, or hybrid. I agree that there is less point in querying gold articles, since libraries don't subscribe to these journals... but for the sake of comparison to the Sci-Hub analyses, let's query them anyways if possible.

tamunro commented 6 years ago

Great to see this underway already! I think the "state of OA" sample may not be ideal:

It's restricted to journal articles, isn't it? I think it's very important to include books and conference papers, because they're important parts of the literature, especially in certain fields, and I expect the access rates will be markedly lower.
it includes requested DOIs, which will bias the estimate strongly upwards if the library has been successful in selectively subscribing to high-demand materials. They should at least be treated separately. Perhaps the Upenn library already records successful and failed DOI requests? Might save some work.
Wouldn't much smaller samples suffice for a pilot test at least? (samples of 1,000 should give a 95% CI of ±3%, which I think is plenty given the purpose of the study). Chasing extreme precision would be futile if the link resolvers turn out to be inaccurate; they can apparently fail to detect licensed access in 20% or more of cases {table 1 in https://doi.org/cdk9}. @publicus, does that sound right? You mention having two methods. If they're independent, then using them both would allow a capture-recapture estimate of the true level of access, and of the sensitivity of each method. Small samples might also make it easier to attract other libraries to volunteer and run the search across lots of them.

dhimmel commented 6 years ago

I think the "state of OA" sample may not be ideal

Note that there are three DOI collections from the State of OA study:

Web of Science: 103,491 articles published between 2009–2015 and classified as citable items in Web of Science.
Unpaywall: 87,322 articles visited by Unpaywall users from June 5–11, 2017.
Crossref: 99,952 articles with Crossref type of journal-article.

As @tamunro points out, each set has its strengths and weaknesses. The 290,120 DOIs represent all DOIs accross the 3 collections that are also in our Crossref catalog of scholarly literature. The reason I selected the State of OA collections is that they will allow us to compare access with oaDOI and allow us to benefit from the oaDOI calls. Furthermore, the comparison will fit well with the current manuscript updates in https://github.com/greenelab/scihub-manuscript/pull/19.

Wouldn't much smaller samples suffice for a pilot test at least?

Yes. If we decide on exactly what we want to evaluate we can get by with less. But if possible, I like the strategy of collect as much data as possible and then analyze with the ability to immediately investigate new directions because the data is there. Remember, we're doing this computationally, so the difference between 100 and 100,000 DOIs may be small. But yes, we can downsize as a fallback option.

jglev commented 6 years ago

I haven't heard back from OCLC support yet (though it's only been a couple of days, so I'm not meaning to criticize them here), so I wrote and tested a downloader script today in Bash, which seems straightforward for this case. It uses my first approach, which involves querying an OpenURL Resolver linked directly to the UPenn Library's holdings records. Essentially, it's an automated way of using the "PennText Article Finder" linked from the Library's main catalog page.

This approach will work well for UPenn, but won't be directly transferable to other libraries, so I hope that we can eventually get the second approach (using OCLC's OpenURL resolver) working consistently.

(OCLC is an organization that many libraries hook their holdings information into; hence, that same API could be used by librarians at different institutions more easily.)

@tamunro is correct, that the two methods could be compared to each other, but I'm not sure whether that will be useful, as both are getting their information from the same data source.

I agree with @tamunro that a sample will suffice -- I also have working code in R for calculating a Bayesian Credible Interval for this, instead of / in addition to a normal Confidence Interval. (Although I think that with binary data ("We have access / We don't have access") and no other parameters, a Credible Interval and a Confidence Interval end up coming out the same).

Having said all that, I also agree with @dhimmel that there's no difference between 1,000 DOIs and 100,000, except for time. The current method can only query one DOI at once -- for the 182,804 Closed DOIs, that's 2.12 days of downloading at one per second (182804/60/60/24), or just under one day at 3 per second. So I think I'll start with the Closed DOIs, then move up to the Bronze DOIs, then the other tiers. Does that sound reasonable to you? In any case, we should be able to get all of the DOIs in this dataset by the end of the week (290120/60/60/24 = 3.36 days of downloading).

I don't want to run the downloader over the weekend in case it causes any problems on the server side, and so will start it on Monday.

jglev commented 6 years ago

To clarify, the downloader gets an XML file for each DOI. The XML contains a key called full_text_indicator, which is either true or false.

dhimmel commented 6 years ago

Does that sound reasonable to you?

Yeah totally. I'd recommend opening a work in progress PR, that downloads a small number of DOIs (like 10), so you can get some review before your do the multi-day download.

jglev commented 6 years ago

I've created a work-in-progress PR (#2), following your suggestion, with ~200 example DOIs covered : )

jglev commented 6 years ago

Incidentally, if I've calculated it correctly (code is here), a Bayesian analysis (Bernoulli likelihood (data-generating) function, with a flat beta-distribution prior) from these 200 DOIs (which are just the first 200 "closed" ones -- they're not randomly sampled, unless the dataset already happened to randomize them) indicates that the rate of full-text access is somewhere between 76% and 86%, with a most-likely value of 82%. That 95% Credible Interval will narrow with more data.

jglev commented 6 years ago

I'm fixing something in my Git repo., so the PR linked repo. isn't present at the moment. I'll fix that momentarily.

jglev commented 6 years ago

A new (fixed, and more expansive) PR is now up as #7.

greenelab / library-access

Initial set of query DOIs from State of OA study #1