Making our Data More Accessible (for Internal and External Use)

rdvelazquez commented 4 years ago

Is your feature request related to a problem? Please describe. This review paper has multiple sources of information:

The paper itself being the main source
Paper summaries and reviews in issues
Discussions in Pull-Requests

This information isn't easily accessible in a programatic way.

It would be nice to include this information in the topic first paper tracking the @rando2 proposed in #144 but the information from our project isn't easily accessible

Describe the solution you'd like Create a dataset of papers that we've looked at with information like:

Link to GitHub issue where paper is summarized/reviewed
If the paper is cited in our review paper (maybe with links to those sections or excerpts)
Links to the Pull-Requests where the paper was involved

A TSV file with general paper columns (DOI, title, author, date) and project specific columns (issue links, pr links, paper links) would probably be fine. Maybe also include an excel file for easy download so those less technically inclined aren't scared away.

I've already started looking at getting some of this information and it does't look too hard:

Describe alternatives you've considered

We could just include this information in the topic specific TSV files proposed in #144 but this would be less useful for general use
We could create a legit database but this seems like overkill for our project

Additional context Relates to #96 #144 and #163

rdvelazquez commented 4 years ago

@rando2 and anyone else: Let me know your thoughts on this. If we think it's something that would be useful I'll work on opening up a pull request with the code to generate a comprehensive TSV file on each Travis build. We could then look at how to syncs this with #144 #163 and any other sources of info.

rando2 commented 4 years ago

Hi @rdvelazquez, this is amazing because I spent all last night trying to figure out how to do this, but I use APIs approximately once every five years, so everything I know appears to be out of date and I did not make a ton of progress!

I had cloned @cornhundred's repository and my plan for today was to try to use PyGithub to pull the issues to make a TSV linking our issue numbers with the DOIs that @cornhundred's code pulls. This would also allow us to note where there is a Mount Sinai Immunology review in the appendix.

There is also some chat on gitter about data sources that pull out COVID-19-related papers from other journals: https://covid-19.cochrane.org/ https://github.com/Aitslab/corona

As you describe above it would be amazing if we could match the issue numbers & PRs with DOIs in a centralized location, like what @cornhundred has done for Mount Sinai Immunology.

I would be very happy to offer back-up support if you have a clear vision for this and want to take the lead -- and I am whole-heartedly in favor of doing this in Python, I had originally wanted to do something very similar to what @cornhundred has set up but got stuck trying to parse what I pulled from bioRxiv! I think it might also be useful to filter by subject area to give suggested reading lists for each topic group (like what I was trying to initialize in #144), but this is the first step and very important to do right! I really like @agitter idea of including a table in the methods pointing to each HTML page, if that ends up working out.

cornhundred commented 4 years ago

Let me know if you all have any questions I'm happy to help 😊

rdvelazquez commented 4 years ago

Sounds good! The GitHub api was a bit of a pain at first but once I got going it's pretty nice. I just make http get requests, get json and the parse from there.

I'll open up a work in progress Pull-Request sometime today or tomorrow for getting our data (DOI, issue-links, pr-links, covid19-review paper links) into a TSV.

From there it would probably be best to:

Agree on a standard form for how we expect data from other sources to be (maybe TSVs with DOI, title, date and then other info)
Create a few functions for joining data in these forms together (try to match on DOI, if that does' work try to fuzzy match title/date)
Create a master dataset of all the info that we want

I'll take a crack at 1. !!

cgreene commented 4 years ago

@rdvelazquez : not sure if you've seen https://github.com/greenelab/covid19-review/pull/163 but we're bringing in the Mt Sinai reviews into an appendix as well :)

rando2 commented 4 years ago

@rdvelazquez Excellent! I can start looking at the various sources for DOIs to see a) how many sources we need, and b) how hard they will be to standardize.

agitter commented 4 years ago

I love this idea.

I've already been trying to go through new paper issues as they're created to standardize the Manubot citation, using the DOI whenever possible. I can continue helping with that and could work through (some) old issues if that is needed to help cross-referencing.

@rdvelazquez I don't know where you would need it, but I wanted to point out that Manubot outputs citations.tsv and references.json in the git repo's output branch after all of its reference processing. It looks like you've already made a lot of progress using the GitHub API directly. In some cases, I found it faster to use GitHub Actions in a workflow that wrap GitHub API calls. Those don't provide an advantage if the API is already working for you.

rdvelazquez commented 4 years ago

Thanks @agitter. The references.json is very helpful. I'm pulling the id field from there and am able to provide a link to the citation in the html covid19-review paper.

One thing that may be helpful regarding paper issues would be adding tags (which we can get from the GitHub api). Things like "reviewed" and "needs-review" could be helpful.

rando2 commented 4 years ago

@rdvelazquez I can definitely add labels! I'm trying to think of how to phrase it so people feel welcome to continue contributing even if something has already been looked at by two people. Maybe something like "Reviews Needed" (if <2) and "Additional Reviews Welcome" (if >= 2)?

rdvelazquez commented 4 years ago

That looks good to me. If you wanted to get more nuanced you could add tags like "pre-print", "First Review" , "Second Review", "External Review" (if Mt Sinai has reviewed it) and then triage based on these but that might be getting too detailed. Youre suggestion above seems good to me.

rando2 commented 4 years ago

Hi @rdvelazquez, I think there should be tags now, for now "Needs Review" and "Additional Reviews Welcome". Note: I didn't compare against the Mount Sinai Immunology list because I figured you'd rather have the tags to work with quickly! Linking them up with the Appendix would take a bit longer.

Here's what I found about different datasets:

Category 1: Traditional Publishing

In terms of traditional publishing datasets, there are pros and cons to everything I looked at (guided by @SonjaAits resources).

My first inclination would be the WHO list, but it seems like they have a CSV file hosted on Azure, and I'm not sure how feasible it would be to update our list automatically off of this.
The list compiled by LitCovid also looks good, and does seem more API friendly (https://www.ncbi.nlm.nih.gov/research/coronavirus-api/export/tsv?), but it only includes the PMID, Title, and Journal. I assume it's feasible to convert DOI to PMID given that manubot is able to work with both, but I thought it'd be worth checking in with @agitter about how to approach this.

Category 2: Preprints

These two lists above cover just traditional publishing venues. @cornhundred already has a way to pull from bioRxiv and medRxiv. I do think there have been some relevant preprints coming out on chemRxiv as well, but starting with these two should get us a huge percentage of the new material coming out!

Category 3: Clinical Trials

The other main category that might be of interest is clinical trial registrations. cochrane.org has a REST API set up but I couldn't figure out how to access the covid-19.cochrane.org/ URL (and cochrane.org/covid-19 and similar versions doesn't work). I don't know if anyone with more API experience has alternative ideas! Instead, it seems like we could just use the Clinical Trials.gov API to pull records with COVID-19 as a search term.

My vote would be to integrate LitCovid with @cornhundred's script to get started, assuming matching a PMID to a DOI is a straight forward problem?

agitter commented 4 years ago

I assume it's feasible to convert DOI to PMID given that manubot is able to work with both

The citation information Manubot extracts will often include both. If you cite by DOI, often it can obtain the PMID as well. If you cite by PMID, in most cases it can obtain a DOI. There are many cases where there is no PMID for a valid DOI and some cases where there is no DOI for a valid PMID.

If we're trying to use Manubot to automatically do cross-referencing of citations to the same article that appears multiple times in the review manuscript, it won't automatically resolve different types of identifiers to the same article. Citing doi:10.1159/000507423 and pmid:32259829 will be treated as distinct identifiers even though they are the same manuscript.

If we want to use Manubot in a script to covert from PMID to DOI as @rando2 suggested, I can help with that.

I do think there have been some relevant preprints coming out on chemRxiv as well

I've been checking chemRxiv ~weekly. We won't miss much by ignoring it at this point. There are tens of computational studies predicting candidate drugs, which I personally find interesting, but without any in vitro follow up they may be outside the current scope of this review.

rdvelazquez commented 4 years ago

@rando2 that sounds great. Thanks for tagging all those issues!

@agitter that's good to know that Manubot can do some of the conversion work for us. I saw your comments on the PR and will look into using Manubot for that. Thanks for pointing that out!

agitter commented 4 years ago

@rdvelazquez let me know what features you're interested in. I could save you time digging through the Manubot docs or code.

rdvelazquez commented 4 years ago

Thanks @agitter! The docs for Manubot were actually pretty nice. I did have one question that I left in your PR review (https://github.com/greenelab/covid19-review/pull/174#discussion_r406812648)

rdvelazquez commented 4 years ago

@rando2 and @agitter I'm planning to fix a few small issues with the script that generates the sources_cross_reference.tsv file. Other than that, things are wrapped up on my end.

I'm more than happy to help with #212 and/or #235 in the future (certainly where they intersect with the sources_cross_refrence.tsv script/file but also in other areas) and I'll keep an eye on those issues to see what direction you all decide to go.

greenelab / covid19-review