Accuracy analysis of full_text_indicator calls

dhimmel commented 6 years ago

We should manually review calls for DOIs to see the accuracy of the full_text_indicator calls. I'd suggest randomly selecting 100 DOIs where full_text_indicator=False and 100 DOIs where full_text_indicator=True. Then we can navigate to the DOI URL while on Penn's network and see if full text access is available.

@publicus what do you think?

dhimmel commented 6 years ago

And perhaps view the DOIs when outside of Penn's network to see whether the article is paywalled at all.

jglev commented 6 years ago

As in our conversation in PR 13, I am in agreement with this. We can in this way assess the rate of false negatives (as well as false positives, though I anticipate those not being prevalent) in the API we used.

(This is also something that we in the Library want to know about! Because it points to what may be an issue somewhere in one of our systems, as noted in the comment linked above, whether at the publisher metadata level or at our OpenURL resolver.)

Steps that I see for doing this:

Write a script (I'd default to R here for anything data-analysis-related, because I'm faster writing it; is that ok with you?) that takes a sample of ~100 DOIs.
1. I'm thinking that this should be a stratified random sample -- thus, the number of bronze DOIs sampled would be proportional to the number of bronze DOIs in the dataset, e.g. Do you agree?
2. The script would save the list of DOIs as a column in a CSV file, with a second, blank column for recording 1 if we manually have access, and 0 if we don't.
Open that spreadsheet, and start filling it in, from on Penn's campus to start (I agree that off-campus is also useful, but starting with on-campus makes sense to me, as users would, I think, have to be on-campus, VPNed in, or otherwise authenticated through the library website to get access to these articles through our recommendation system, anyway).
Write a script that compares the manual 1s and 0s to the automated downloader 1s and 0s, creating a column indicating a false negative, and a column indicating a false positive.
1. Take the code that I wrote for PR #8 (which I think is no longer needed for that use), and adapt it to run two analyses: one where the DV is false negative status, and one where the DV is false positive status. So we would be estimating a Credible Interval around the "true rate" of false negatives and then false positives in the dataset.
2. The analysis could take one of two forms, as I see it:
  1. Run an analysis for each of the OA colors separately.
  2. More elegantly, create a random-intercepts multi-level model, where each OA color group is allowed to have its own intercept. That regression equation would look like this:
    - The individual DOI level (for DOI i):
      FalseNegative_i = beta_0j + error
    - The OA Color level (for color level j): beta0j = gamma_00 + error
We could then say, e.g., "We saw a rate of X% of access for the bronze DOIs. Further manual analysis indicated a false negative rate of between Y% and Z%, though, so the rate of library access for broze DOIs is likely closer to X+Y% to X+Z%."

@dhimmel, what are your thoughts about all that? Are there any other coauthors who would like to look over this analysis plan?

Also, I know that you were hoping for an answer this week. I need to work to finish a project with a deadline most of the rest of this week; if I devoted Monday the 11th to this, would that work for you?

dhimmel commented 6 years ago

I'm thinking that this should be a stratified random sample -- thus, the number of bronze DOIs sampled would be proportional to the number of bronze DOIs in the dataset, e.g. Do you agree?

No. I think it should be stratified by penn access status (100 articles where full_text_indicator is true and 100 articles where full_text_indicator is false). Since full_text_indicator and oaDOI color are not independent, stratifying by oaDOI color could inject bias. As an extreme example, imagine all gold articles had full_text_indicator equal to true. Therefore, the stratification would require picking a certain number of gold articles where full_text_indicator was false, although none actually exist.

Don't place too much emphasis on the oaDOI colors for this analysis. As we've discussed in https://github.com/greenelab/scihub-manuscript/issues/36, they are themselves an automated metric with imperfect quality.

Open that spreadsheet, and start filling it in, from on Penn's campus to start (I agree that off-campus is also useful, but starting with on-campus makes sense to me, as users would, I think, have to be on-campus, VPNed in, or otherwise authenticated through the library website to get access to these articles through our recommendation system, anyway).

Yes. I think we will need eventually want both. Access from within Penn's network, but also access outside as a control. If all articles are freely available off campus, then Penn's subscriptions aren't providing crucial access. We should limit the off campus search to the publisher's site... i.e. following the DOI, is the full text available.

Can we start with a PR to select the 200 DOIs? Make sure its deterministic. Should be only a few lines in either R or python.

dhimmel commented 6 years ago

Therefore, the stratification would require picking a certain number of gold articles where full_text_indicator was false, although none actually exist.

If the stratification is within class (full_text_indicator status), I'm okay with it because it won't have the mentioned problem. Although I think it will probably end up being unnecessary.

jglev commented 6 years ago

The latter (stratification within full_text_status) is what I had in mind, to clarify : )

And yes; I'll get a PR up for deterministically sampling on Monday.

dhimmel commented 6 years ago

See https://github.com/greenelab/library-access/pull/18 / https://github.com/greenelab/library-access/commit/46a529c6b4dc1017dfc90172e5090e10086488e2 for the manual classifications by @publicus with on- and off-campus access for 200 DOIs stratified by PennText status.

I looked into the results in this notebook, since that was the easiest way for me to think of how we want to look at this data. I'll hold off PRing this notebook in case @publicus want to expand it or take it in another direction.

Anyways what I think we can deduce:

All open access articles (available off campus) were available on campus. Good!
29 of the 100 articles where PennText indicated no access were actually open access. We have previously speculated that PennText was designed primarily for toll access content, so this is not a big surprise.
For toll access articles where PennText indicated no access, it was wrong 25.4% of the time. In other words, Penn actually had subscription access unbeknownst to PennText.
For toll access articles where PennText indicated access, it was wrong 10.7% of the time. In other words, Penn couldn't access something PennText indicated it could.

@publicus is this your interpretation as well? What do you think of these results?

jglev commented 6 years ago

Thank you for getting these percentages, @dhimmel!

Just a quick note to say that I'll be meeting with my supervisor, who's the Library's metadata architect, this afternoon, to talk through this. I'll then come back and write up my thoughts here.

jglev commented 6 years ago

I spoke with my supervisor in a 1-1 meeting yesterday (just before campus shut down because of the weather), and brought this up in our Library Technology Services team meeting today, and have formed some thoughts:

First, @dhimmel, thank you again for running those numbers. Your Jupyter notebook looks good and correct to me.

It's also useful to me to note that, overall, PennText was wrong 26.5% of the time (in R, I got this with the following:

dataset <- read_delim(
    "./evaluate_library_access_from_output_tsv/manual-doi-checks.tsv",
    "\t",
    escape_double = FALSE,
    trim_ws = TRUE
)
length(which(dataset$full_text_indicator_automated != dataset$full_text_indicator_manual_inside_campus))/nrow(dataset) # Percentage of time that PennText is wrong (in either direction)
# [1] 0.265

So, this obviously is a problem, on several levels:

In the context of this project, it's a problem because it requires that we put a large (up to 26.5%) "margin of error" (or something conceptually equivalent) around our estimates re: Penn's access to the DOIs we sampled. Rhetorically, that causes the use of this DOI analysis to be blunted.
It points to a larger-scope problem for the Library, since it turns out that PennText is giving users incorrect information at a rate that's higher than I expected (I expected small errors here and there, but this is surprisingly large). So, rhetorically, it is noteworthy how hard it is to keep track of the subscriptions that the University is paying money for. And that is possibly partially because...
It also possibly points to a largest-scale problem with one of the vendors the Library contracts with to provide bibliographic metadata. Keep in mind, I do not know this for sure, but following my meeting today, it sounds like one hypothesis to test is that, when ProQuest acquired Ex Libris (the Library's collection management software is written by Ex Libris), it initiated a massive merge of its existing bibliographic database with Ex Libris'. This merge may have produced errors in the output dataset that's in production use by that vendor.

It could also be the case that, as we have speculated, the OpenURL resolver on which PennText is built is just looking at journal subscription dates, and not the OA status of any particular article. It may also be that there are configuration issues with the OpenURL resolver. Or the errors could be coming from some combination of the above.

I intend to take this question, of why there are errors in PennText's OpenURL resolver, and where they're coming from, and make a more in-depth study of it. That seems to me to be out of the scope of this manuscript, so I'm noting it here just to say that this finding is unexpected and needs follow-up within the Libraries.

For this project, I see us having two main options:

Add the "margin of error" around the percentage output from our existing PennText-derived datset.
Going back to an original comment by @tamunro, just use this sample of 200 DOIs, and do an analysis from it.
- We could also take the manual-DOI-check facilitator script I wrote for this issue, and use it to add to the sample (e.g., if we wanted to bring it up to 500).
- We could take the code from PR #8 and actually end up using it for estimating from this smaller sample what the access rate would be if we had an accurate census of all of the DOIs -- whether by calculating a traditional 95% Confidence Interval, or, as in #8, doing the same basic idea from a Bayesian approach to get a 95% Credible Interval.

Of those two options, I think that the latter has more promise. Though I think it's unfortunate that, if we go that route, all that API querying wouldn't be used! :slightly_frowning_face:

I also recognize that, as project lead, you've got a timeline you'd like to stick to. So regardless of which way we go forward, since either option above requires some additional work, I think that it would be useful for us / I would feel good about talking through your timeline expectations, so that I can feel confident that we're on the same page.

dhimmel commented 6 years ago

It's also useful to me to note that, overall, PennText was wrong 26.5% of the time

For this measure, I think you should average the accuracy on PennText true and false, weighted by the overall prevalence of PennText true and false. This would undo the effect of stratification. But I don't expect the accuracy will change much. I agree that the level of inaccuracy makes the PennText calls a poor estimator of Penn's access.

I think the best option at this point is to proceed with additional manual checks. Perhaps a total of 500 DOIs. These should be randomly selected from all State of OA DOIs (i.e. a random subset of the DOIs which we queried PennText for). We could transfer somewhere between 100 and 200 of the existing manual calls.

Time is of the essence. We'd like to resubmit the manuscript in the next couple weeks. So let's focus exclusively at this point on getting 500 manual calls. @publicus, how about I open a PR to select the 500 DOIs, and then you take it from there?

Though I think it's unfortunate that, if we go that route, all that API querying wouldn't be used!

I agreee. Although we can still report the PennText findings and data, perhaps in methods or a supplemental figure. The difficulty of ascertaining library access using the library systems is an interesting finding in and of itself.

jglev commented 6 years ago

I think the best option at this point is to proceed with additional manual checks. Perhaps a total of 500 DOIs. These should be randomly selected from all State of OA DOIs (i.e. a random subset of the DOIs which we queried PennText for). We could transfer somewhere between 100 and 200 of the existing manual calls.

I'm on board with this. What are your thoughts re: counting the existing manual calls? Since the sample was random, I'm fine with using all 200. Or, perhaps, taking a random sample that's not stratified by PennText status, then randomly taking from the 200 calls in proportion to the non-stratified sample's PennText statuses. Or, if that sounds undesirable, I'd be fine just taking a new random sample of 500, JOINing any manual calls that happen to be in that new sample, and going from there.

If this is still going toward Figure 8b, it seems to me that it could make sense to stratify the sample by origin dataset (Web of Science, Unpaywall, Crossref) -- what are your thoughts on that? (If you see similar issues as you raised with stratifying the initial sample, I'm fine not stratifying.)

Finally, so that we're on the same page going in, what are your thoughts on the Confidence Interval / Credible Interval approach? If we are going towards Figure 8b, it seems to me that we'd either need to run the CI estimation for each data source (i.e., 3 runs of analysis), or else make a random intercepts model where DOIs are nested within data source. The latter is probably cleaner; but the former is easier, and probably not as much of a philosophical problem (because Confidence Intervals carry the NHST issues of Type-1 error) if we use the Bayesian approach.

Time is of the essence. We'd like to resubmit the manuscript in the next couple weeks. So let's focus exclusively at this point on getting 500 manual calls. @publicus, how about I open a PR to select the 500 DOIs, and then you take it from there?

Noted. Please do open the PR. By "open," do you mean that you'll take the sample of 500? (I'm asking to confirm; I'm unsure from your comment).

dhimmel commented 6 years ago

Please do open the PR. By "open," do you mean that you'll take the sample of 500?

Yep see https://github.com/greenelab/library-access/pull/19. Was able to migrate 178 of the existing 200 calls.

Since the sample was random, I'm fine with using all 200

We cannot use all 200 because in a sample of 500 that matches the PennText proportion across all DOIs, there are only 78 PennText false DOIs. But luckily we can reuse most.

If this is still going toward Figure 8b, it seems to me that it could make sense to stratify the sample by origin dataset (Web of Science, Unpaywall, Crossref) -- what are your thoughts on that?

I shifted torwards combining all the DOIs from Web of Science, Unpaywall, and Crossref in the main text. I don't think splitting it added enough to warrant the additional complexity. We can still have the split figures as supplements.

I was planning on replacing Figure 8A with cell 6 here and Figure 8B with cell 22 here.

The Penn portion of these plots we'd update to use the results from these 500 manual calls.

what are your thoughts on the Confidence Interval / Credible Interval approach

It could be nice to report intervals. This is less of a priority to me than getting the calls, since we can always add it at a later analysis step. I'd like to learn more about the credible intervals from you at some point.

jglev commented 6 years ago

Just a note to say that I'm in and out of meetings throughout the day, but that I am continuing to work through the DOIs between those meetings.

jglev commented 6 years ago

When off-campus, https://doi.org/10.2307/3795274 gives me an option to "Read this item online for free by registering for a MyJSTOR account." So I'm counting it as having access. This is a note so that I won't forget, for transparency.

dhimmel commented 6 years ago

Read this item online for free by registering for a MyJSTOR account

My inclination would be to consider login-walled content not open. Creating an account is somewhat prohibitive.

jglev commented 6 years ago

This is the only DOI so far that's said that, so it's not a problem to go back and change what I recorded for it. I don't have strong feelings about it -- but if the original question is "can a user access the full-text of the paper?" then it seems consistent to me to say that this satisfies that criterion, even if toward the tail-end of prohibitiveness within the distribution of things that count as "access." To me, this is in a similar way as how an article released openly but under a CC BY-NC-ND license is also prohibitive in what it allows users to do with it, but still provides the full text (In that latter case, I would more strongly feel about counting it as access, and am reasoning from that analogy here).

Do you feel strongly about counting it as inaccessible?

jglev commented 6 years ago

A quick note that the following DOIs are also of this type:

https://doi.org/10.2307/3726111
https://doi.org/10.1039/tf9656100383 ("This content is free to download. Please choose one of the options below to gain access. Log in Using your FREE account")
https://doi.org/10.2307/3736092
https://doi.org/10.2307/317747
https://doi.org/10.2307/1122632

https://doi.org/10.1001/jama.1965.03080200059025 possibly is, as well, though it's hidden behind an extra click. In that case, it says "Create a free personal account to download free article PDFs, sign up for alerts, and more," which is not necessarily about this particular PDF (it implies but doesn't state that as directly as the examples above), so, at least for now, I've recorded it as not having access.

dhimmel commented 6 years ago

Do you feel strongly about counting it as inaccessible?

I think we should. There are some types of login walls that should count as no access. For example, if the sign up process is laborious, requires users to identify themselves, agreeing to legal conditions, signing up for spam, etcetera. Because of the difficulty to draw a line, I think it'll be most straightforward to not sign up for any accounts in order to get access.

Did you actually sign up for access? I can imagine sometimes the systems may have flaws so even with an account, you still couldn't access those articles.

jglev commented 6 years ago

Because of the difficulty to draw a line, I think it'll be most straightforward to not sign up for any accounts in order to get access.

I hear this. Based on it, I'm fine with changing the values in question manually. I'm glad that we had the discussion about it, as it does seem to me to be a decision we needed to make actively.

I didn't sign up for access, so your note re: system flaws may also be correct.

tamunro commented 6 years ago

I agree with categorizing JSTOR as non-open. It clearly doesn't meet the original definition:

free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.

All you get if you register is streaming access, so that's disqualifying already. I'm registered with JSTOR, and I've used it a few times. And flawed it sure is:

1) Crappy streaming: not like Readcube (crisp images, smooth scrolling, find and copy words), but low-res images only, one at a time. Reading a long paper that way would be torture.

2) You only get access to three articles at a time, and you have to keep them for two weeks before you can remove one and get another. So even if you classified those three articles as gratis, the corpus as a whole is still closed. Completely unworkable for a literature review.

I also registered for AAAS's delayed access, but I have never had a success. The article is always mysteriously one of the ones that's not available. JAMA's app sounds better, but I haven't tried it.

jglev commented 6 years ago

@tamunro, I appreciated reading your reasoning; thank you!

tamunro commented 6 years ago

No worries @publicus. I'd be happy to help with the count too, if I don't need a Penn login. On the undercounting of access in PennText you mentioned above, although that is no doubt a huge pain in the neck now, great find! I'm glad you're looking into it. It's a vivid illustration of how borked the current system is, and a big addition to the larger project.

dhimmel commented 6 years ago

Closed by work in several PRs, specifically:

https://github.com/greenelab/library-access/pull/20 https://github.com/greenelab/library-access/pull/21 https://github.com/greenelab/library-access/pull/25

greenelab / library-access

Accuracy analysis of full_text_indicator calls #15