Adding accuracy analysis results

jglev commented 6 years ago

This is an in-progress PR -- a branch to push the results of the manual DOI checks as I add them.

jglev commented 6 years ago

I started going through the DOI sample this morning, and quickly realized that there are a couple questions that we should agree on before I keep going through them. Specifically:

How "user-experience-focused" should I be when doing this?
I've seen one DOI so far where, when I went to the page that the DOI resolved to (the publisher page), I was told I didn't have access. But when I Googled the article title, I found a ScienceDirect page for the article, which did give me access. (This was one where PennText told me that I did have access.)
In that case, it makes sense to me to say that we have legal access. But, that may be going a step further than some users would, since the DOI page itself seemed to say that we don't have access.
Since the underlying question here is "Do we have legal access to the article from on-/off-campus," my inclination is to do whatever it takes to legally find the article -- if the DOI page doesn't have it, then Google it, or sign in if necessary through the publisher page (if my Penn Affiliation isn't automatically noted) -- at least for the "on-campus" round of checks (for the "off-campus" round, I wouldn't do any authentication). I had assumed that being on-campus would always automatically grant access, but one DOI so far has made me question that -- it may not be the case that all publishers automatically see where the request is coming from, I'm thinking now.
Just so we're on the same page, I'll only mark that we have access to a record if I can actually get a PDF or HTML full-text document. There was one DOI so far that had a "Get PDF" button that looked like it indicated that full-text was available, but then took me to a paywall :confused: Do you have any objections to that metric?

@dhimmel, do you have additional thoughts on these two questions? If not, I'll keep at this throughout the day.

dhimmel commented 6 years ago

I've seen one DOI so far where, when I went to the page that the DOI resolved to (the publisher page), I was told I didn't have access. But when I Googled the article title, I found a ScienceDirect page for the article, which did give me access. (This was one where PennText told me that I did have access.)

Interesting. What's the DOI? Does the publisher's page link to ScienceDirect? My first inclination is to require the access to have resulted from DOI resolution and then following any necessary links. However, I can see how compilations that Penn subscribes to, such as ScienceDirect and JSTOR, could cause problems here. Can we see how many DOIs have these situations and then make a more systematic decision?

I'll only mark that we have access to a record if I can actually get a PDF or HTML full-text document.

Agreed!

jglev commented 6 years ago

The DOI is 10.1017/s1357729800051109. However, looking more closely now, I realize that the Science Direct article that came up is by the same authors, with the same first phrase of the title, but is actually a different article! I was skimming yesterday, and didn't even notice it today until the third reading. So, nevermind about that. : P

jglev commented 6 years ago

This is a progress note so that I won't forget: the DOI 10.1017/s1357729800051109, which resolves to Cambridge University Press, does not give full-text access -- when I try to log in via Shibboleth to UPenn, the CUP site says that the login has failed. The manual version of PennText (linked from the CUP page under "Get access" -> "Check library catalog" seems to indicate that there is full-text access to earlier DOI in the sample from CUP, but I couldn't figure out how to access it, after 5+ minutes of trying, so I think it's reasonable that I marked that full-text is not accessible and moved on.

I went back to our actual XML data from the PennText backend, using this query:

SELECT * FROM dois_table
JOIN library_holdings_data ON
dois_table.database_id = library_holdings_data.doi_foreign_key
WHERE dois_table.doi = "10.1017/s1357729800051109"

And the data for that DOI does (correctly) indicate that there is no full-text available.

This is all to say that I've only gotten through two dozen DOIs, and I'm already confused on occasion about what I as a user have access to.

jglev commented 6 years ago

(I made the above note to note that CUP login isn't working, which is either a problem with CUP's site, or one that we in the library should look into. And to note, anecdotally, that in several cases already, I spent several minutes trying to figure out whether I have access to an article.)

jglev commented 6 years ago

Another progress note: our automated access checker is looking for just electronic access, I think I can confirm: The XML response for doi https://doi.org/10.1016/0306-2619(90)90086-s (using the query in the comment above, but with this new DOI) indicates that there is not full-text access, while the manual version of PennText (which uses the same system as our automated checker) indicates that there is access through LIBRA, which is the Library's off-site physical storage center.

dhimmel commented 6 years ago

@publicus yes! it's not always straightforward.

My inclination is to count any login-walled articles as inaccessible. If you're on Penn's network and it wants you to do some sort of laborious login or account setup, then it should be no access... agree?

our automated access checker is looking for just electronic access

Great!

jglev commented 6 years ago

it's not always straightforward.

Indeed! Goodness.

My inclination is to count any login-walled articles as inaccessible. If you're on Penn's network and it wants you to do some sort of laborious login or account setup, then it should be no access... agree?

Following my comment yesterday, I am still of the somewhat-opposite opinion, qualified by being within reason. Since we're trying to verify the validity of PennText saying that we do/don't have legal access, I'm fine with authenticating (e.g., to go through a Shibboleth login page if required). What I've found so far, though (I'm 36 DOIs in) is that every time Penn's network hasn't automatically triggered access, trying to login manually through Penn hasn't granted access. So, I think that the initial access page is a good rule-of-thumb indicator, which I think is in line with your comment, yes? But I am still following the idea of making a "good faith" effort to authenticate if it's apparent how to. Does that all sound reasonable to you, as well?

As in my note yesterday, though, this all applies to the on-campus / in-network series of checks. For the off-campus run, I fully agree with you, and won't authenticate at all.

dhimmel commented 6 years ago

But I am still following the idea of making a "good faith" effort to authenticate if it's apparent how to.

I guess if it requests for you to authorize your PennKey (linked from the DOI landing page), that could still be access. It doesn't really make sense since you're inside of Penn's network. However, if it's like create a personal account... then verify that your account uses a Penn email, I think that's out of scope. Or if it requires some sort of login that requires a librarian... that's inaccessible. We concur?

jglev commented 6 years ago

Agreed, yes. : )

jglev commented 6 years ago

Happy new year!

Ok, I've completed and pushed the manual checks for all 200 DOIs on-campus. I've also set aside several hours tomorrow to do an off-campus check.

For clarity: The on-campus check was done using my laptop, which was hard-wired (i.e., via ethernet) into the network in the Van Pelt library.

jglev commented 6 years ago

As you'll see, there are quite a few false negatives (PennText reports no access, but going to the DOI link allows access). So, I'm guessing that our openURL resolver is not always tracking open-access articles (vs. overall subscription date ranges, e.g.). This is something I'll bring up with our dev. team in the Libraries, as it's an area to improve our services.

On that note, I'm thinking now that internally, I do want to see what trends this shows across the open access "colors" (even if color assignment was imperfect). Given these data, once I do that for my own use, would you be interested to see it here, as well? I'm just checking, since we left it that we wouldn't look by color for this project, but this is also more false negatives that I expected.

There are also false positives, but they're (just eyeballing it) a much lower percentage of cases.

My guess is that if this is happening with the openURL resolver that we use at UPenn, it's likely happening at other institutions, as well; either because of the software stack that we're using, or because this points to what is potentially a major difficulty in tracking what one actually has access to.

jglev commented 6 years ago

There were two DOIs I marked as "invalid." One (https://doi.org/10.17816/jowd6265-11) didn't resolve, and the other (https://doi.org/10.3892/or.2012.2190) redirected to a publisher home page (https://www.spandidos-publications.com/).

dhimmel commented 6 years ago

As you'll see, there are quite a few false negatives (PennText reports no access, but going to the DOI link allows access). So, I'm guessing that our openURL resolver is not always tracking open-access articles (vs. overall subscription date ranges, e.g.).

I agree. I think these articles will mostly be available off-campus as well (i.e. open access).

I'm guessing the PennText tool is most focused on tracking access to toll-access content, so it's not a huge surprise it's unaware that it, by default, has access to much OA literature.

I'm just checking, since we left it that we wouldn't look by color for this project, but this is also more false negatives that I expected.

I'm interested in off-campus access as the control here. Let's focus the work in this repo on that definition of open access (rather than oaDOI colors). If you do look by oaDOI color, feel free to share a link here. I'd be interested to take a look, but would like to keep it separate for efficiency.

jglev commented 6 years ago

I'm guessing the PennText tool is most focused on tracking access to toll-access content

Agreed, exactly.

If you do look by oaDOI color, feel free to share a link here. I'd be interested to take a look, but would like to keep it separate for efficiency.

Sure! And to confirm, I'll work tomorrow to finish the off-campus checks before diving into any of the color-based analyses I mentioned.

jglev commented 6 years ago

For transparency: In order to achieve off-campus access, rather than actually leaving campus, I'm using SSHuttle to create a transparent SOCKS proxy back to my house (which is not on campus). I have confirmed that this works using DOI 10.1111/j.1550-7408.1962.tb02648.x: When the proxy is not engaged (and I am thus on the campus network), I have access to the article. When the proxy is engaged, I no longer have access to the article. Further, I've confirmed that my IP address is changing to my home IP when the proxy is engaged, by visiting http://ipv4.icanhazip.com/ in my browser.

jglev commented 6 years ago

Ok, all DOIs are checked both on- and off-campus!

In the off-campus search, there was one different DOI that I marked as "invalid" -- 10.17816/jowd6265-11 worked this time (it didn't in the on-campus check), but 10.3934/dcds.2016103 gave a server error.

jglev commented 6 years ago

So, concretely, what do we now want from these data? Does this look accurate to you?

Using the on-campus check, percentage of false-negatives from PennText
Using the on-campus check, percentage of false-positives from PennText
Comparing the on- and off-campus checks, percentage where on-campus had access but off-campus didn't
Comparing the on- and off-campus checks, percentage where on-campus did not have access but off-campus did

jglev commented 6 years ago

It's also worth noting, I think, that some of the publisher pages are very confusing; to the point that I felt it necessary to redo the on-campus checks I'd completed before the New Year just now, because I've learned a lot in the last two days from looking through the publisher pages more. Wiley, for example, often shows a loading page after a user clicks the "Download PDF" button, before then redirecting to a paywall page.

dhimmel commented 6 years ago

So, concretely, what do we now want from these data? Does this look accurate to you?

Let's merge this PR first and calculate stats in a future PR.

The counts of the following categories should be sufficient (from https://github.com/greenelab/library-access/pull/17#discussion_r156785257):

Available on campus, Not available off campus
Available on campus, available out off campus
Not available on campus, not available off campus
Not available on campus, available off campus.

This will be a sort of confusion matrix, but let's avoid that confusing terminology as well as TP/FP if we can :smile:

jglev commented 6 years ago

At this point, I think I've answered all outstanding comments for this PR. Does that seem correct to you, as well?

dhimmel commented 6 years ago

At this point, I think I've answered all outstanding comments for this PR. Does that seem correct to you

No. 10.3892/or.2012.2190 is still marked as invalid. See https://github.com/greenelab/library-access/pull/18#discussion_r159744278... let's assess access for this articles using the URL https://www.spandidos-publications.com/10.3892/or.2012.2190

jglev commented 6 years ago

On- and off-campus checks for DOI 10.3892/or.2012.2190 are now updated, as of 8f5871e16.

greenelab / library-access

Adding accuracy analysis results #18