Add dataset created from processing State of OA DOIs

jglev commented 6 years ago

This is an in-progress PR, which will eventually contain the dataset I'm downloading and processing from the State of OA DOIs list.

jglev commented 6 years ago

As of this writing, I've queried just over 106,000 DOIs. The full dataset has ~290,000, as I remember. I currently plan to leave the downloader running over the weekend.

jglev commented 6 years ago

I've committed the dataset (containing all 290,120 rows), as well as an RMarkdown file for generating a markdown-formatted table that can presumably be worked into your code for the figure you mentioned. How does this look to you?

Here's the current table output from that RMarkdown file:

oa_doi_color	no_access_percent	yes_access_percent	yes_access_rate	oa_color_total
bronze	17.79	82.21	36077	43886
closed	17.43	82.57	150934	182804
gold	3.37	96.63	22018	22786
green	9.78	90.22	23441	25981
hybrid	15.45	84.55	12397	14663

jglev commented 6 years ago

To make sure the table's column names make sense: for the OA "Bronze" level:

There were 43,886 DOIs in that category.
Of those 43,886, UPenn has access to 36,077.
That means that UPenn has access to 36,007 / 43,883 = 82.21% of the DOIs queried in that category.
It also means that UPenn does not have access to 100 - 82.21 = 17.79% of the DOIs queried in that category.

jglev commented 6 years ago

Ha, and just to double-check, I just now ran the RMarkdown file, and then ran length(which(original_dataset_with_oa_color_column$oadoi_color == "bronze")), to check from the original dataset itself that the number of Bronze DOIs is 43,886. To confirm, it is. : )

dhimmel commented 6 years ago

Great! Excited for these access calls and incorporating them into the Sci-Hub Manuscript.

data/library_coverage_xml_and_fulltext_indicators.tsv.xz still isn't tracked with LFS. Perhaps stop tracking it and then re-add it.

I did confirm that it contained 290121 lines:

curl --location --silent \
  https://github.com/publicus/library-access/raw/5f04fbcacbef4cefcba41c79b23d58294afc6b72/data/library_coverage_xml_and_fulltext_indicators.tsv.xz \
   | xzcat | wc --lines

So that's good.

There's still this problematic line in .gitignore:

./library_coverage_xml_and_fulltext_indicators.db*

Can you change it to

data/library_coverage_xml_and_fulltext_indicators.db

jglev commented 6 years ago

I've made a new commit to remove and re-add data/library_coverage_xml_and_fulltext_indicators.tsv.xz. Has the tracking of that file been solved in f2d98d9? I'm having trouble telling.

Re: the line in .gitignore, removing the wildcard will cause git to prompt users to add library_coverage_xml_and_fulltext_indicators.db-shm and library_coverage_xml_and_fulltext_indicators.db-wal, which are created whenever the database is opened (because write-ahead logging is turned on). That seems undesirable to me -- does it seem desirable to you, though?

I've been thinking more about what the table I posted above is telling us, and have a few thoughts to discuss / figure out together, so I'll type those up next...

dhimmel commented 6 years ago

Both XZ files are now tracked with LFS. See "Git LFS file not shown" under Files Changed.

It's wrong (although possible) to track a file that's ignored. How about:

data/library_coverage_xml_and_fulltext_indicators.db*
!data/library_coverage_xml_and_fulltext_indicators.db.xz

I think that should track the XZ file and ignore the others (see https://git-scm.com/docs/gitignore)

jglev commented 6 years ago

I spoke with my supervisor this afternoon about the table I posted above, and we came out of our conversation with several questions about the data, and what to draw from them. From our conversation, there are two big points that I think are important to note:

The table shows how much the Library's catalog says users have access to, which is not necessarily the same thing as what users do have access to.

As an example: Our results indicate that the Library's system would tell users that they have access to 82.21% of the "bronze" DOIs -- but by definition, all bronze DOIs should be available to users, since they're openly accessible through the publisher's website. (A similar point applies to gold and green DOIs.)

We can take an example DOI from that remaining bronze 17.79%, 10.1002/2013JD021255. If we go directly in a web browser to doi.org/10.1002/2013JD021255, we get the publisher's webpage for the article, which does have full-text (at least from my system as I write this, on Penn's campus). If a user goes to the Library's search tool, though (click here, then click on "Penn Text Article Finder" at the bottom of the page), and enters 10.1002/2013JD021255 in the DOI field, she'll get this page, which does not reflect that full-text access.

The process of resolving a DOI, comparing it to a list of journal subscriptions, and then figuring out whether full text is available is complicated, and could break down at any of several steps, including:

Something wrong with the metadata the publisher supplies about the article
The metadata from the publisher was correct, but isn't now (e.g., with bronze DOIs, the DOI may have been free in the past, but the publisher has since locked it down).
Something wrong with the services used by the intermediary the Library uses to resolve DOIs.
Something wrong with Penn Text Search itself.

This is all to say that it's not yet apparent where that 17.79% disconnect comes from. It could also be the case that some of the DOIs themselves don't resolve (as an additional issue alongside those enumerated above).

Similarly, the "green" row of the table shows the percentage of DOIs that the Library has access to through the publisher website.

This is a smaller element to note; it's slightly different from what the State of OA authors (page 6) defined "hybrid" as: "Toll-access on the publisher page, but there is a free copy in an OA repository."

So, what to take from this:

I think there are two main points to keep in mind as we incorporate this into the manuscript:

Rhetorically, the emphasis here should be more on the experience of the user than on the Library's access itself -- in cases like with the DOI above, the user does have legal access to the DOI, but is told that there isn't access through the Library's system. And that could have implications for users seeking that DOI from alternative sources, including SciHub.
With the system that we queried (which is what the PennText Search frontend uses -- hence point 1 above), and with any system, there is going to be some rate of false negatives (and maybe even false positives).
One way that I could look into this point is by taking a couple of hours, taking a small sample of DOIs from each category (e.g., a few dozen), manually resolving the DOI, and recording whether a user has access. Then, I could use the rate estimator that I wrote for PR #8 to get an interval around the rate of false negatives. I'd be willing to do this -- it seems useful for clarifying what these data actually tell us.

In any case, these seem like things to note explicitly in the write-up wherever these data get incorporated.

Does this all make sense as I'm writing it here? @dhimmel, are there thoughts that you have around this?

jglev commented 6 years ago

Oh, I see a place where we may have been talking past each other: ./library_coverage_xml_and_fulltext_indicators.db* (i.e., the top-level directory of this repo., from git's perspective) is where the untracked database gets saved by our downloader script.

That database, which is untracked, then gets copied in compressed format into data/library_coverage_xml_and_fulltext_indicators.db.xz, which is tracked.

Thus, the .gitignore line ./library_coverage_xml_and_fulltext_indicators.db* shouldn't be affecting the data/ copy of the database. But if it is, I should then add a new line, !data/library_coverage_xml_and_fulltext_indicators.db.xz, as you suggested, correct?

dhimmel commented 6 years ago

@publicus I agree with your commentary above. Can you repost it in a new issue, since this PR isn't the ideal place for that discussion. Coincidentally, I just opened #15 about manually investigating certain calls.

dhimmel commented 6 years ago

I see a place where we may have been talking past each other

Got it. I didn't realize the database was in the top-level directory. It really would make the most sense in the data directory? Would it be difficult to move the db location? If not, could we do that in this PR?

Otherwise, this all looks good.

dhimmel commented 6 years ago

Got it. I didn't realize the database was in the top-level directory. It really would make the most sense in the data directory? Would it be difficult to move the db location? If not, could we do that in this PR?

Actually I think we should do this in a separate PR that will be quick after merging this one. Will merge.

greenelab / library-access