Closed dhimmel closed 6 years ago
And perhaps view the DOIs when outside of Penn's network to see whether the article is paywalled at all.
As in our conversation in PR 13, I am in agreement with this. We can in this way assess the rate of false negatives (as well as false positives, though I anticipate those not being prevalent) in the API we used.
(This is also something that we in the Library want to know about! Because it points to what may be an issue somewhere in one of our systems, as noted in the comment linked above, whether at the publisher metadata level or at our OpenURL resolver.)
Steps that I see for doing this:
1
if we manually have access, and 0
if we don't.1
s and 0
s to the automated downloader 1
s and 0
s, creating a column indicating a false negative, and a column indicating a false positive.
i
):FalseNegative_i = beta_0j + error
j
):
beta0j = gamma_00 + error
X+Y
% to X+Z
%."@dhimmel, what are your thoughts about all that? Are there any other coauthors who would like to look over this analysis plan?
Also, I know that you were hoping for an answer this week. I need to work to finish a project with a deadline most of the rest of this week; if I devoted Monday the 11th to this, would that work for you?
I'm thinking that this should be a stratified random sample -- thus, the number of bronze DOIs sampled would be proportional to the number of bronze DOIs in the dataset, e.g. Do you agree?
No. I think it should be stratified by penn access status (100 articles where full_text_indicator
is true and 100 articles where full_text_indicator
is false). Since full_text_indicator
and oaDOI color are not independent, stratifying by oaDOI color could inject bias. As an extreme example, imagine all gold articles had full_text_indicator
equal to true. Therefore, the stratification would require picking a certain number of gold articles where full_text_indicator
was false, although none actually exist.
Don't place too much emphasis on the oaDOI colors for this analysis. As we've discussed in https://github.com/greenelab/scihub-manuscript/issues/36, they are themselves an automated metric with imperfect quality.
Open that spreadsheet, and start filling it in, from on Penn's campus to start (I agree that off-campus is also useful, but starting with on-campus makes sense to me, as users would, I think, have to be on-campus, VPNed in, or otherwise authenticated through the library website to get access to these articles through our recommendation system, anyway).
Yes. I think we will need eventually want both. Access from within Penn's network, but also access outside as a control. If all articles are freely available off campus, then Penn's subscriptions aren't providing crucial access. We should limit the off campus search to the publisher's site... i.e. following the DOI, is the full text available.
Can we start with a PR to select the 200 DOIs? Make sure its deterministic. Should be only a few lines in either R or python.
Therefore, the stratification would require picking a certain number of gold articles where full_text_indicator was false, although none actually exist.
If the stratification is within class (full_text_indicator status), I'm okay with it because it won't have the mentioned problem. Although I think it will probably end up being unnecessary.
The latter (stratification within full_text_status
) is what I had in mind, to clarify : )
And yes; I'll get a PR up for deterministically sampling on Monday.
See https://github.com/greenelab/library-access/pull/18 / https://github.com/greenelab/library-access/commit/46a529c6b4dc1017dfc90172e5090e10086488e2 for the manual classifications by @publicus with on- and off-campus access for 200 DOIs stratified by PennText status.
I looked into the results in this notebook, since that was the easiest way for me to think of how we want to look at this data. I'll hold off PRing this notebook in case @publicus want to expand it or take it in another direction.
Anyways what I think we can deduce:
@publicus is this your interpretation as well? What do you think of these results?
Thank you for getting these percentages, @dhimmel!
Just a quick note to say that I'll be meeting with my supervisor, who's the Library's metadata architect, this afternoon, to talk through this. I'll then come back and write up my thoughts here.
I spoke with my supervisor in a 1-1 meeting yesterday (just before campus shut down because of the weather), and brought this up in our Library Technology Services team meeting today, and have formed some thoughts:
First, @dhimmel, thank you again for running those numbers. Your Jupyter notebook looks good and correct to me.
It's also useful to me to note that, overall, PennText was wrong 26.5% of the time (in R, I got this with the following:
dataset <- read_delim(
"./evaluate_library_access_from_output_tsv/manual-doi-checks.tsv",
"\t",
escape_double = FALSE,
trim_ws = TRUE
)
length(which(dataset$full_text_indicator_automated != dataset$full_text_indicator_manual_inside_campus))/nrow(dataset) # Percentage of time that PennText is wrong (in either direction)
# [1] 0.265
So, this obviously is a problem, on several levels:
It could also be the case that, as we have speculated, the OpenURL resolver on which PennText is built is just looking at journal subscription dates, and not the OA status of any particular article. It may also be that there are configuration issues with the OpenURL resolver. Or the errors could be coming from some combination of the above.
I intend to take this question, of why there are errors in PennText's OpenURL resolver, and where they're coming from, and make a more in-depth study of it. That seems to me to be out of the scope of this manuscript, so I'm noting it here just to say that this finding is unexpected and needs follow-up within the Libraries.
For this project, I see us having two main options:
Of those two options, I think that the latter has more promise. Though I think it's unfortunate that, if we go that route, all that API querying wouldn't be used! :slightly_frowning_face:
I also recognize that, as project lead, you've got a timeline you'd like to stick to. So regardless of which way we go forward, since either option above requires some additional work, I think that it would be useful for us / I would feel good about talking through your timeline expectations, so that I can feel confident that we're on the same page.
It's also useful to me to note that, overall, PennText was wrong 26.5% of the time
For this measure, I think you should average the accuracy on PennText true and false, weighted by the overall prevalence of PennText true and false. This would undo the effect of stratification. But I don't expect the accuracy will change much. I agree that the level of inaccuracy makes the PennText calls a poor estimator of Penn's access.
I think the best option at this point is to proceed with additional manual checks. Perhaps a total of 500 DOIs. These should be randomly selected from all State of OA DOIs (i.e. a random subset of the DOIs which we queried PennText for). We could transfer somewhere between 100 and 200 of the existing manual calls.
Time is of the essence. We'd like to resubmit the manuscript in the next couple weeks. So let's focus exclusively at this point on getting 500 manual calls. @publicus, how about I open a PR to select the 500 DOIs, and then you take it from there?
Though I think it's unfortunate that, if we go that route, all that API querying wouldn't be used!
I agreee. Although we can still report the PennText findings and data, perhaps in methods or a supplemental figure. The difficulty of ascertaining library access using the library systems is an interesting finding in and of itself.
I think the best option at this point is to proceed with additional manual checks. Perhaps a total of 500 DOIs. These should be randomly selected from all State of OA DOIs (i.e. a random subset of the DOIs which we queried PennText for). We could transfer somewhere between 100 and 200 of the existing manual calls.
I'm on board with this. What are your thoughts re: counting the existing manual calls? Since the sample was random, I'm fine with using all 200. Or, perhaps, taking a random sample that's not stratified by PennText status, then randomly taking from the 200 calls in proportion to the non-stratified sample's PennText statuses. Or, if that sounds undesirable, I'd be fine just taking a new random sample of 500, JOIN
ing any manual calls that happen to be in that new sample, and going from there.
If this is still going toward Figure 8b, it seems to me that it could make sense to stratify the sample by origin dataset (Web of Science, Unpaywall, Crossref) -- what are your thoughts on that? (If you see similar issues as you raised with stratifying the initial sample, I'm fine not stratifying.)
Finally, so that we're on the same page going in, what are your thoughts on the Confidence Interval / Credible Interval approach? If we are going towards Figure 8b, it seems to me that we'd either need to run the CI estimation for each data source (i.e., 3 runs of analysis), or else make a random intercepts model where DOIs are nested within data source. The latter is probably cleaner; but the former is easier, and probably not as much of a philosophical problem (because Confidence Intervals carry the NHST issues of Type-1 error) if we use the Bayesian approach.
Time is of the essence. We'd like to resubmit the manuscript in the next couple weeks. So let's focus exclusively at this point on getting 500 manual calls. @publicus, how about I open a PR to select the 500 DOIs, and then you take it from there?
Noted. Please do open the PR. By "open," do you mean that you'll take the sample of 500? (I'm asking to confirm; I'm unsure from your comment).
Please do open the PR. By "open," do you mean that you'll take the sample of 500?
Yep see https://github.com/greenelab/library-access/pull/19. Was able to migrate 178 of the existing 200 calls.
Since the sample was random, I'm fine with using all 200
We cannot use all 200 because in a sample of 500 that matches the PennText proportion across all DOIs, there are only 78 PennText false DOIs. But luckily we can reuse most.
If this is still going toward Figure 8b, it seems to me that it could make sense to stratify the sample by origin dataset (Web of Science, Unpaywall, Crossref) -- what are your thoughts on that?
I shifted torwards combining all the DOIs from Web of Science, Unpaywall, and Crossref in the main text. I don't think splitting it added enough to warrant the additional complexity. We can still have the split figures as supplements.
I was planning on replacing Figure 8A with cell 6 here and Figure 8B with cell 22 here.
The Penn portion of these plots we'd update to use the results from these 500 manual calls.
what are your thoughts on the Confidence Interval / Credible Interval approach
It could be nice to report intervals. This is less of a priority to me than getting the calls, since we can always add it at a later analysis step. I'd like to learn more about the credible intervals from you at some point.
Just a note to say that I'm in and out of meetings throughout the day, but that I am continuing to work through the DOIs between those meetings.
When off-campus, https://doi.org/10.2307/3795274 gives me an option to "Read this item online for free by registering for a MyJSTOR account." So I'm counting it as having access. This is a note so that I won't forget, for transparency.
Read this item online for free by registering for a MyJSTOR account
My inclination would be to consider login-walled content not open. Creating an account is somewhat prohibitive.
This is the only DOI so far that's said that, so it's not a problem to go back and change what I recorded for it. I don't have strong feelings about it -- but if the original question is "can a user access the full-text of the paper?" then it seems consistent to me to say that this satisfies that criterion, even if toward the tail-end of prohibitiveness within the distribution of things that count as "access." To me, this is in a similar way as how an article released openly but under a CC BY-NC-ND license is also prohibitive in what it allows users to do with it, but still provides the full text (In that latter case, I would more strongly feel about counting it as access, and am reasoning from that analogy here).
Do you feel strongly about counting it as inaccessible?
A quick note that the following DOIs are also of this type:
https://doi.org/10.1001/jama.1965.03080200059025 possibly is, as well, though it's hidden behind an extra click. In that case, it says "Create a free personal account to download free article PDFs, sign up for alerts, and more," which is not necessarily about this particular PDF (it implies but doesn't state that as directly as the examples above), so, at least for now, I've recorded it as not having access.
Do you feel strongly about counting it as inaccessible?
I think we should. There are some types of login walls that should count as no access. For example, if the sign up process is laborious, requires users to identify themselves, agreeing to legal conditions, signing up for spam, etcetera. Because of the difficulty to draw a line, I think it'll be most straightforward to not sign up for any accounts in order to get access.
Did you actually sign up for access? I can imagine sometimes the systems may have flaws so even with an account, you still couldn't access those articles.
Because of the difficulty to draw a line, I think it'll be most straightforward to not sign up for any accounts in order to get access.
I hear this. Based on it, I'm fine with changing the values in question manually. I'm glad that we had the discussion about it, as it does seem to me to be a decision we needed to make actively.
I didn't sign up for access, so your note re: system flaws may also be correct.
I agree with categorizing JSTOR as non-open. It clearly doesn't meet the original definition:
free availability on the public internet, permitting any users to read, download, copy, distribute, print, search, or link to the full texts of these articles, crawl them for indexing, pass them as data to software, or use them for any other lawful purpose, without financial, legal, or technical barriers other than those inseparable from gaining access to the internet itself.
All you get if you register is streaming access, so that's disqualifying already. I'm registered with JSTOR, and I've used it a few times. And flawed it sure is:
1) Crappy streaming: not like Readcube (crisp images, smooth scrolling, find and copy words), but low-res images only, one at a time. Reading a long paper that way would be torture.
2) You only get access to three articles at a time, and you have to keep them for two weeks before you can remove one and get another. So even if you classified those three articles as gratis, the corpus as a whole is still closed. Completely unworkable for a literature review.
I also registered for AAAS's delayed access, but I have never had a success. The article is always mysteriously one of the ones that's not available. JAMA's app sounds better, but I haven't tried it.
@tamunro, I appreciated reading your reasoning; thank you!
No worries @publicus. I'd be happy to help with the count too, if I don't need a Penn login. On the undercounting of access in PennText you mentioned above, although that is no doubt a huge pain in the neck now, great find! I'm glad you're looking into it. It's a vivid illustration of how borked the current system is, and a big addition to the larger project.
Closed by work in several PRs, specifically:
https://github.com/greenelab/library-access/pull/20 https://github.com/greenelab/library-access/pull/21 https://github.com/greenelab/library-access/pull/25
We should manually review calls for DOIs to see the accuracy of the
full_text_indicator
calls. I'd suggest randomly selecting 100 DOIs where full_text_indicator=False and 100 DOIs where full_text_indicator=True. Then we can navigate to the DOI URL while on Penn's network and see if full text access is available.@publicus what do you think?