Closed jglev closed 6 years ago
As of this writing, I've queried just over 106,000 DOIs. The full dataset has ~290,000, as I remember. I currently plan to leave the downloader running over the weekend.
I've committed the dataset (containing all 290,120 rows), as well as an RMarkdown file for generating a markdown-formatted table that can presumably be worked into your code for the figure you mentioned. How does this look to you?
Here's the current table output from that RMarkdown file:
oa_doi_color | no_access_percent | yes_access_percent | yes_access_rate | oa_color_total |
---|---|---|---|---|
bronze | 17.79 | 82.21 | 36077 | 43886 |
closed | 17.43 | 82.57 | 150934 | 182804 |
gold | 3.37 | 96.63 | 22018 | 22786 |
green | 9.78 | 90.22 | 23441 | 25981 |
hybrid | 15.45 | 84.55 | 12397 | 14663 |
To make sure the table's column names make sense: for the OA "Bronze" level:
36,007 / 43,883 =
82.21% of the DOIs queried in that category.100 - 82.21 =
17.79% of the DOIs queried in that category.Ha, and just to double-check, I just now ran the RMarkdown file, and then ran length(which(original_dataset_with_oa_color_column$oadoi_color == "bronze"))
, to check from the original dataset itself that the number of Bronze DOIs is 43,886. To confirm, it is. : )
Great! Excited for these access calls and incorporating them into the Sci-Hub Manuscript.
data/library_coverage_xml_and_fulltext_indicators.tsv.xz
still isn't tracked with LFS. Perhaps stop tracking it and then re-add it.
I did confirm that it contained 290121 lines:
curl --location --silent \
https://github.com/publicus/library-access/raw/5f04fbcacbef4cefcba41c79b23d58294afc6b72/data/library_coverage_xml_and_fulltext_indicators.tsv.xz \
| xzcat | wc --lines
So that's good.
There's still this problematic line in .gitignore
:
./library_coverage_xml_and_fulltext_indicators.db*
Can you change it to
data/library_coverage_xml_and_fulltext_indicators.db
I've made a new commit to remove and re-add data/library_coverage_xml_and_fulltext_indicators.tsv.xz
. Has the tracking of that file been solved in f2d98d9? I'm having trouble telling.
Re: the line in .gitignore
, removing the wildcard will cause git
to prompt users to add library_coverage_xml_and_fulltext_indicators.db-shm
and library_coverage_xml_and_fulltext_indicators.db-wal
, which are created whenever the database is opened (because write-ahead logging is turned on). That seems undesirable to me -- does it seem desirable to you, though?
I've been thinking more about what the table I posted above is telling us, and have a few thoughts to discuss / figure out together, so I'll type those up next...
Both XZ files are now tracked with LFS. See "Git LFS file not shown" under Files Changed.
It's wrong (although possible) to track a file that's ignored. How about:
data/library_coverage_xml_and_fulltext_indicators.db*
!data/library_coverage_xml_and_fulltext_indicators.db.xz
I think that should track the XZ file and ignore the others (see https://git-scm.com/docs/gitignore)
I spoke with my supervisor this afternoon about the table I posted above, and we came out of our conversation with several questions about the data, and what to draw from them. From our conversation, there are two big points that I think are important to note:
As an example: Our results indicate that the Library's system would tell users that they have access to 82.21% of the "bronze" DOIs -- but by definition, all bronze DOIs should be available to users, since they're openly accessible through the publisher's website. (A similar point applies to gold and green DOIs.)
We can take an example DOI from that remaining bronze 17.79%, 10.1002/2013JD021255
. If we go directly in a web browser to doi.org/10.1002/2013JD021255
, we get the publisher's webpage for the article, which does have full-text (at least from my system as I write this, on Penn's campus). If a user goes to the Library's search tool, though (click here, then click on "Penn Text Article Finder" at the bottom of the page), and enters 10.1002/2013JD021255
in the DOI field, she'll get this page, which does not reflect that full-text access.
The process of resolving a DOI, comparing it to a list of journal subscriptions, and then figuring out whether full text is available is complicated, and could break down at any of several steps, including:
This is all to say that it's not yet apparent where that 17.79% disconnect comes from. It could also be the case that some of the DOIs themselves don't resolve (as an additional issue alongside those enumerated above).
This is a smaller element to note; it's slightly different from what the State of OA authors (page 6) defined "hybrid" as: "Toll-access on the publisher page, but there is a free copy in an OA repository."
I think there are two main points to keep in mind as we incorporate this into the manuscript:
In any case, these seem like things to note explicitly in the write-up wherever these data get incorporated.
Does this all make sense as I'm writing it here? @dhimmel, are there thoughts that you have around this?
Oh, I see a place where we may have been talking past each other:
./library_coverage_xml_and_fulltext_indicators.db*
(i.e., the top-level directory of this repo., from git
's perspective) is where the untracked database gets saved by our downloader script.
That database, which is untracked, then gets copied in compressed format into data/library_coverage_xml_and_fulltext_indicators.db.xz
, which is tracked.
Thus, the .gitignore
line ./library_coverage_xml_and_fulltext_indicators.db*
shouldn't be affecting the data/
copy of the database. But if it is, I should then add a new line, !data/library_coverage_xml_and_fulltext_indicators.db.xz
, as you suggested, correct?
@publicus I agree with your commentary above. Can you repost it in a new issue, since this PR isn't the ideal place for that discussion. Coincidentally, I just opened #15 about manually investigating certain calls.
I see a place where we may have been talking past each other
Got it. I didn't realize the database was in the top-level directory. It really would make the most sense in the data
directory? Would it be difficult to move the db location? If not, could we do that in this PR?
Otherwise, this all looks good.
Got it. I didn't realize the database was in the top-level directory. It really would make the most sense in the data directory? Would it be difficult to move the db location? If not, could we do that in this PR?
Actually I think we should do this in a separate PR that will be quick after merging this one. Will merge.
This is an in-progress PR, which will eventually contain the dataset I'm downloading and processing from the State of OA DOIs list.