greenelab / library-access

Collecting data on whether library access to scholarly literature
Other
5 stars 3 forks source link

Adding full text true rate estimator #8

Closed jglev closed 6 years ago

jglev commented 6 years ago

This is a work-in-progress PR which depends on (i.e., is branched from) PR #7 (which adds an API query script and some example data). It adds an analysis script, written in R, for performing a Bayesian estimation of the "true" rate of full-text access to DOIs like those in the dataset.

I was planning to wait to add this until #7 is merged, but since I've written the code and was playing with it this morning, I figured I'd put it up here now.

The Bayesian analysis follows one of the most basic examples in the literature -- it uses a Bernoulli likelihood (data-generating) function, with a flat beta-distributed prior. Our binary full-text access indicator data is essentially of the same type as coin-flip data, which makes it easy to apply this example from the literature.

Portions of this code (the model for stan, the Bayesian estimator software) are from a BSD-3-Clause-licensed (the author states the "new BSD license") example, as cited in the copyright section of the script and above where the code itself is used.

I've written this to comply with the Google R Style Guide cited by the Greene Lab Onboarding documentation.

Todo

jglev commented 6 years ago

To clarify, since this PR inherits all of the commits from #7, the one new file in this new PR is estimate_true_rate_of_fulltext_access_bayesian_approach.R.

jglev commented 6 years ago

One more explanatory note: The script currently draws data from both of the locations used in PR #7: It gets full-text access information from the SQLite database, and joins that to DOI open-access "colors" from the original tsv dataset. This will allow subsetting the data by color, if we want to eventually.

Beyond just, e.g., looking at only more-closed DOIs (like 'closed' and 'bronze'), one useful thing that we could do with this is add oa color as a predictor into the model. I'm not sure whether that's a useful approach in the larger context of this project and its manuscript, so I'm just putting it out here for discussion before implementing it.

jglev commented 6 years ago

Finally, the Lab onboarding documentation notes that "We write code for our analyses in Python or R, which allows everyone in the lab to know two languages and understand analytical code." Unlike Python (with sqlalchemy), I think that R doesn't have an object-based interface for SQL. Thus, the script includes a basic, but raw, SQL query. Hopefully this is in keeping with the spirit of the onboarding documentation (I'm not sure whether you consider SQL a separate language), as I don't think that there's another way in R for accessing the data in the sqlite database.

dhimmel commented 6 years ago

Yes. SQL is fine! Although you may want to check out dbplyr:

The goal of dbplyr is to automatically generate SQL for you so that you’re not forced to use it. However, SQL is a very large language and dbplyr doesn’t do everything.

It looks like this package (github) makes dplyr work with database backends.

I was imagining we'd want to extract a TSV from the databases that omits the api_response column. This way it's really easy to read in the access data. My thinking is we should have a PR that converts the DB to a TSV and that should come before further analysis of the data. Then the data analyses can read the TSV, which should be more development-friendly.

I'm going to hold off on reviewing this PR until we merge #7. It's not clear to me that we want to be doing these analyses in this repo as opposed to greenelab/scihub... if we're computing intervals for the library access coverage, we should probably also do that for the Sci-Hub coverage as well.

jglev commented 6 years ago

I apologize for my delay responding -- I was at a conference this week, and returned yesterday evening.

As always, thanks for your comments. I hadn't heard about dbplyr; it looks like a good tool for the toolkit. For now, I'll leave the SQL, since you wrote it's ok by Lab guidelines, but am open to incorporating dbplyr in the future.

To confirm that I've understood, does this look correct to you re: what you want to see for a PR that you accept:

Split this PR into two PRs:

  1. One PR on this repo., to convert data from the database into a TSV, with the columns from the original TSV, plus a full_text_indicator column
  2. One PR on greenelab/scihub, to do the actual estimation of full-text access

Responding to your final sentence, I added the estimation step because of @tamunro's original suggestion to sample, which alluded to getting a Confidence Interval, both steps I agree with. (As noted here in a comment in the code, a Confidence Interval should come out in this case to be the same as a Bayesian Credible Interval; I think it's beneficial for the scientific literature to prefer using the latter, so I used it here.) As I understand, since the SciHub coverage analysis was basically on a census rather than a sample, getting an estimate wouldn't be necessary for the SciHub aspect of the paper. Does that understanding jibe with yours?

dhimmel commented 6 years ago

Split this PR into two PRs

Yep I think we're on the same page.

Regarding the second PR, I was thinking we'd add UPenn's coverage as bars in Figure 8B. Since these measure coverage on samples of DOIs, intervals would be appropriate. Also Figure 2A could use the same interval.

Let's save design discussion for these intervals for later. It'll probably make most sense for me to make the updates to these figures. However, I'm happy for advice on calculating the intervals. I haven't used BCIs before and would be open to using them, if their is a conceptual advantage and the implementation is straightforward.

jglev commented 6 years ago

Ah, thanks for pointing me to to Figure 8B in the manuscript. I misunderstood, so to update what I said before: I think now (and this seems to agree with what you wrote earlier) that the Credible Interval or Confidence Interval analysis is not necessary for the library records, given that they'll be incorporated into that figure (or one like it).

If, however, the idea is to extrapolate from this and make inferences about DOI access beyond this dataset, I agree that we'd want to do the analysis on both the library data and SciHub data that will go into that figure. And in that case, the R script I have here (which I'll eventually transfer to a PR on the greenelab/scihub repo.) is straightforward to use. It'll just need to have a line added to the settings section to subset for different levels of access (Closed, Hybrid, Green, etc.).