outbreak-info / outbreak.info-resources

A curated repository of metadata of resources on COVID-19 and SARS-CoV-2
MIT License
0 stars 4 forks source link

[PROTOCOL] Crawl / manually curate / map Protocol resources into individual metadata records based off of schema #19

Open flaneuse opened 4 years ago

flaneuse commented 4 years ago

Using https://docs.google.com/spreadsheets/d/1FFyRhI5TeUb-B4t50HRZYF_XngRt_mfAIz82UQ3geFk/edit#gid=0

andrewsu commented 4 years ago

In addition to any protocol records in the gdoc linked above, will also want to automate searches in protocols.io (API: https://apidoc.protocols.io/) and Nature Protocols (API: https://dev.springernature.com/)

andrewsu commented 4 years ago

For example, a protocols.io search might look like this: https://www.protocols.io/api/v3/protocols?filter=%22public%22&key=%22covid-19%22

We would want to parse that in to a JSON file that is a list of dicts, where the top-level keys are the doi and the remaining structure corresponds to the data schema defined in #3...

sundaram-covid19-biohack commented 4 years ago

@andrewsu , Thank you for the follow-up.
Happy to work on a data parser for this and other resources. I will review the info cited above (in this issue) and follow-up with questions.

sundaram-covid19-biohack commented 4 years ago

Hi @andrewsu

If I understand correctly, you're seeking software solution for:

  1. the retrieval of protocol data (JSON format) from protocols.io and from Nature Protocols
  2. the transformation of the retrieved datasets (JSON payload) into a Bioschemas compliant JSON-LD encoding with doi being the top-level keys

Please advise whether there exist (requirements) specification documents that outline:

  1. the mapping from protocols.io JSON to Bioschemas JSON-LD
  2. the mapping from Nature Protocols JSON to Bioschemas JSON-LD

This will help inform the software implementation. Please let me know if I've misunderstood. Happy to connect on Slack to discuss in person if easier.

Thank you, Jay

flaneuse commented 4 years ago

@sundaram-covid19-biohack that sounds exactly correct. We have a prototype of a protocols schema written, but we have yet to do the mapping from protocols.io/Nature Protocols to this schema.

I would suggest:

  1. Pull the protocol metadata from protocols.io and Nature Protocols using the terms in #18 (and any others you think are relevant; note we want to focus on COVID-19 and not SARS/MERS/other coronaviruses at this point).
  2. Figure out what keys are available in the .json document and write a dictionary for protocols.io and a dictionary for Nature Protocols to map those fields to our schema. We'd be happy to review that mapping and/or help you generate it, if you can give us the object keys from each source.
  3. Execute the mapping
  4. DOI is a good ID key, as you suggest

Let us know if you have any questions. Will defer to @andrewsu on connecting on Slack.

sundaram-covid19-biohack commented 4 years ago

@flaneuse Thank you for the clarification.

I've implemented the first program for retrieving the protocols.io JSON.

Am planning to fork this code-base and then commit the first Python program (for retrieving the protocols.io search results in JSON format) along with a control file (covid_terms.txt).

Please advise if you prefer method for sharing the code.

FYI - pseudocode:

for term in covid_terms:
  url = ' https://www.protocols.io/api/v3/protocols?filter="public"&key=" + term + '"'
  outfile = outdir + term.replace(' ','')
  wget(url, outfile
sundaram-covid19-biohack commented 4 years ago

@flaneuse Please provide a similar URL for the Nature Protocols.

sundaram-covid19-biohack commented 4 years ago

@flaneuse I've attached a reformatted (added newlines and indentation) protocols.io JSON for "covid19".

Please review and advise which fields/values should be extracted and written to the bioschemas JSON-LD file.

Thank you Jay

protocols.io.covid19.reformatted.json.gz

flaneuse commented 4 years ago

@sundaram-covid19-biohack that sounds great. For now, feel free to throw your code into a separate folder; we haven't thought about organization yet but will organize in the future. In general, pull requests from your fork should work nicely for review.

For Nature Protocols, you'll need to use the Springer Nature Metadata API. I'm in the middle of something atm so I don't have a URL off the top of my head, but you'll need to both specify the query and limit it just to protocols (the API includes journal articles, book chapters, protocols but we only want protocols).

We'll take a look at the mapping in a bit. Thanks!

gkarthik commented 4 years ago

@sundaram-covid19-biohack Thank you for taking a crack at this! As @flaneuse said, pull requests would be convenient to review.

The fields in the protocols schema prototype have comments that may help in deciding the mapping between protocols.io and the schema. Once you have a mapping that you think works, please submit a pull request and we can review it.

flaneuse commented 4 years ago

@flaneuse I've attached a reformatted (added newlines and indentation) protocols.io JSON for "covid19".

Please review and advise which fields/values should be extracted and written to the bioschemas JSON-LD file.

Thank you Jay

protocols.io.covid19.reformatted.json.gz

@sundaram-covid19-biohack is the .json-ld a random subset of 2 results, or all results? Seems like there should be more then 2 protocols.

andrewsu commented 4 years ago

Will defer to @andrewsu on connecting on Slack.

Let's try to stick to Github issues as much as possible in the interest of persistence and transparency. If real-time communication becomes important to work through particularly sticky issues, we can use slack or web conferencing...

Please advise whether there exist (requirements) specification documents that outline:

1) the mapping from protocols.io JSON to Bioschemas JSON-LD 2) the mapping from Nature Protocols JSON to Bioschemas JSON-LD

And just to reiterate what @gkarthik and @flaneuse wrote above, it would be great if you could take an initial crack at this mapping! Thanks @sundaram-covid19-biohack!

sundaram-covid19-biohack commented 4 years ago

@gkarthik : for the record- I've derived the following 40 terms (and corresponding comments) from the protocols schema prototype and will use those to inform my derivation of a mapping from the protocols.io search results JSON content.

sundaram-covid19-biohack commented 4 years ago

@gkarthik - my first attempt at deriving a mapping is here.
While I am happy to take the first pass at them, would like to request that an SME derive the mappings.

I can see that there is some author metadata available in the protocols.io JSON.
I'll work on that next. Per the protocols schema prototype, it looks like there is only support for one author (see line 52). Should the software preferentially select the first author?

The script that uses the mapping file to actually extract the data from the protocols.io JSON has already been implemented.

sundaram-covid19-biohack commented 4 years ago

@flaneuse I've attached a reformatted (added newlines and indentation) protocols.io JSON for "covid19". Please review and advise which fields/values should be extracted and written to the bioschemas JSON-LD file. Thank you Jay protocols.io.covid19.reformatted.json.gz

@sundaram-covid19-biohack is the .json-ld a random subset of 2 results, or all results? Seems like there should be more then 2 protocols.

@flaneuse : Sorry I missed your previous comment.
Not sure why there are only 2 results.
That is what the software retrieved.
I can confirm that when you navigate to the same URL in the browser- only 2 results are available.

flaneuse commented 4 years ago

@flaneuse I've attached a reformatted (added newlines and indentation) protocols.io JSON for "covid19". Please review and advise which fields/values should be extracted and written to the bioschemas JSON-LD file. Thank you Jay protocols.io.covid19.reformatted.json.gz

@sundaram-covid19-biohack is the .json-ld a random subset of 2 results, or all results? Seems like there should be more then 2 protocols.

@flaneuse : Sorry I missed your previous comment. Not sure why there are only 2 results. That is what the software retrieved. I can confirm that when you navigate to the same URL in the browser- only 2 results are available.

Thanks @sundaram-covid19-biohack -- I wasn't sure if those were the results from the first query terms or all the query terms.

flaneuse commented 4 years ago

Per the protocols schema prototype, it looks like there is only support for one author (see line 52). Should the software preferentially select the first author?

@sundaram-covid19-biohack The owl:cardinality: many attribute off of author indicates that author should be an array of Authors, so the mapping file should map the list of authors scraped from protcols.io into an author array.

andrewsu commented 4 years ago

The script that uses the mapping file to actually extract the data from the protocols.io JSON has already been implemented.

@sundaram-covid19-biohack can you post the output of that script for us to look at?

sundaram-covid19-biohack commented 4 years ago

@flaneuse Thank you for the pointer. Will adjust accordingly. @andrewsu Planning to resume the effort a couple of hours from now.
Will post the requested output at that time.
Will also provide a readme that outlines how the software should be installed and executed.

sundaram-covid19-biohack commented 4 years ago

@andrewsu sample invocation:

python extract_terms_from_protocols_io_json.py --protocols_json_file COVID19.reformatted.json \
--protocols_schema_mapping_file mapping/protocols_schema_mapping.txt \
--verbose

The following two required command-line parameters reference files committed to the repo: --protocols_json_file --protocols_schema_mapping_file

If --outfile is not specified, a default will be assigned.

Output file attached (I keep gzipping these files because Github does not allow .json files to be attached.)

Note that this does not yet include support for extracting the author data.

COVID19_extracted_data.json.gz

sundaram-covid19-biohack commented 4 years ago

@andrewsu - Please see the corrected output (attached). @flaneuse - This includes the authors list. Will work on the author affiliation next. COVID19_extracted_data_v2.json.gz

flaneuse commented 4 years ago

Thanks @sundaram-covid19-biohack. Do you have a list of all the Object keys found in all the results you retrieved? (sorry if I missed this) If so, we can help with the mapping table you're generating. It'd be good to do the full join of all the properties available from protocols.io with the properties in our schema, to be able to see what protocols.io has that we don't, and what we would like that they don't have. Thanks!

sundaram-covid19-biohack commented 4 years ago

@flaneuse - Please see attached output.
Contains the author.affiliation list.
Please let me know if I interpreted the protocols bioschema specification properly. Suspecting that that the expectation is for the authors info to be a list of dictionaries with keys 'name' and 'affiliation'- as opposed to the result the software is currently producing. COVID19_extracted_data_v3.json.gz

sundaram-covid19-biohack commented 4 years ago

Thanks @sundaram-covid19-biohack. Do you have a list of all the Object keys found in all the results you retrieved? (sorry if I missed this) If so, we can help with the mapping table you're generating. It'd be good to do the full join of all the properties available from protocols.io with the properties in our schema, to be able to see what protocols.io has that we don't, and what we would like that they don't have. Thanks!

Here is the list of Object keys (and corresponding rdfs:comment values) that I derived from the protocols schema prototype.

I used that list to guide my manual inspection of the protocols.io results JSON in order to identify candidate keys.

For the bioschema object keys that did have a candidate in the the protocols.io output JSON, I provided a corresponding jsonpath search parameter in this mapping file.

The first column is the bioschema key/term and the second column is the json search parameter.

Note that no value in the second column means that I could not identify a viable candidate key in the protocols.io results JSON.

I think that file (mapping/protocols_schema_mapping.txt) contains the info you seek. (Please advise if I've misunderstood.)

Thus, moving forward- one would just need to update that particular mapping file (mapping/protocols_schema_mapping.txt) to affect the behavior of the data extractor program for the protocols.io JSON results.

Then, for each new/different JSON results type (e.g.: Nature Protocols), one would just need to define a new mapping file (e.g.: mapping/nature_protocols_schema_mapping.txt - not yet defined).

I can demonstrate this once I figure out how to retrieve the Nature Protocols JSON. Will try to resume in a couple of hours.

sundaram-covid19-biohack commented 4 years ago

Thanks @sundaram-covid19-biohack. Do you have a list of all the Object keys found in all the results you retrieved? (sorry if I missed this) If so, we can help with the mapping table you're generating. It'd be good to do the full join of all the properties available from protocols.io with the properties in our schema, to be able to see what protocols.io has that we don't, and what we would like that they don't have. Thanks!

Also, the log file will indicate if the software could not find a key and/or value.

andrewsu commented 4 years ago

Thank you @sundaram-covid19-biohack, looking good! Two requested changes please...

First, can you reformat the output slightly? Currently you have this (abbreviated):

[
  {
    "url": "sars-cov-2-genome-sequencing-using-long-pooled-amp-befyjbpw",
    "name": "SARS-CoV-2 Genome Sequencing Using Long Pooled Amplicons on Illumina Platforms",
     "identifier": "dx.doi.org/10.17504/protocols.io.befyjbpw"
  },
  {
    "url": "sars-cov-2-virus-plaque-assays-biosafety-level-3-bdtni6me",
    "name": "SARS-CoV-2 virus plaque assays [Biosafety Level 3]",
    "identifier": "dx.doi.org/10.17504/protocols.io.bdtni6me"
  }
]

which should be updated to this:

{
  "dx.doi.org/10.17504/protocols.io.befyjbpw": {
      "url": "sars-cov-2-genome-sequencing-using-long-pooled-amp-befyjbpw",
      "name": "SARS-CoV-2 Genome Sequencing Using Long Pooled Amplicons on Illumina Platforms"
  },
  "dx.doi.org/10.17504/protocols.io.bdtni6me": {
    "url": "sars-cov-2-virus-plaque-assays-biosafety-level-3-bdtni6me",
    "name": "SARS-CoV-2 virus plaque assays [Biosafety Level 3]"
  }
}

Second, to prevent overloading the protocols.io server, can you add in a sleep delay in between API calls? Default to 5 seconds please.

We may also ask you to change one thing with how the program is called, but just checking with the team on that...

sundaram-covid19-biohack commented 4 years ago

Thank you @sundaram-covid19-biohack, looking good! Two requested changes please...

First, can you reformat the output slightly? Currently you have this (abbreviated):

@andrewsu - oversight on my part. will fix that next.

Second, to prevent overloading the protocols.io server, can you add in a sleep delay in between API calls? Default to 5 seconds please.

I will make it configurable with default 5 seconds. Please note that the protocols_tobioschemas.py executable does not make the request to the resource servers. It is the retrieve*.py executables that will send the requests. I'll update them to use the (configurable) 5 seconds rule

We may also ask you to change one thing with how the program is called, but just checking with the team on that...

Sure, do let me know.

I will set it up so that folks can just execute a shell script e.g.:

./protocols_to_bioschemas.sh

or

bash protocols_to_bioschemas.sh

or something along those lines.

sundaram-covid19-biohack commented 4 years ago

Here is the shell script.

You can execute is like this:

./protocols_to_bioschemas.sh
sundaram-covid19-biohack commented 4 years ago

@andrewsu - the output looks like this now:

{
  "dx.doi.org/10.17504/protocols.io.befyjbpw": {
    "url": "sars-cov-2-genome-sequencing-using-long-pooled-amp-befyjbpw",
    "name": "SARS-CoV-2 Genome Sequencing Using Long Pooled Amplicons on Illumina Platforms",
    "status": 2,
    "author.name": [
      "John-Sebastian Eden",
      "Eby Sim"
    ],
    "author.affiliation": [
      "Westmead Institute for Medical Research; University of Sydney",
      "University of Sydney; Centre for Infectious Diseases and Microbiology - Public Health; NSW Health Pathology - ICPMR"
    ]
  },
  "dx.doi.org/10.17504/protocols.io.bdtni6me": {
    "url": "sars-cov-2-virus-plaque-assays-biosafety-level-3-bdtni6me",
    "name": "SARS-CoV-2 virus plaque assays [Biosafety Level 3]",
    "status": 2,
    "author.name": [
      "tenOever Lab"
    ],
    "author.affiliation": [
      "Icahn School of Medicine at Mount Sinai"
    ]
  }
}
andrewsu commented 4 years ago

@sundaram-covid19-biohack great, thank you for the quick updates. To better integrate this with our existing infrastructure for scheduling and automation please also make these changes:

  1. put the two scripts into a submodule (a folder with __init__.py) with a name that reflects the data source (in this case, protocols.io)
  2. rename protocolsio_json_retriever.py as dumper.py
  3. rename extract_terms_from_protocols_io_json.py as parser.py
  4. update the parser script to include an entry function (load_data(input_file_or_input_dir)) that returns JSON objects as a generator (better) or a list

Again, thank you for all your efforts on this!

andrewsu commented 4 years ago

oh, and regarding the author notation, yes, ideally the author info would be represented like this:

{
  "dx.doi.org/10.17504/protocols.io.befyjbpw": {
    "url": "sars-cov-2-genome-sequencing-using-long-pooled-amp-befyjbpw",
    "name": "SARS-CoV-2 Genome Sequencing Using Long Pooled Amplicons on Illumina Platforms",
    "status": 2,
    "author": [
      {
        "name": "John-Sebastian Eden",
        "affiliation": "Westmead Institute for Medical Research; University of Sydney"
      },
      {
        "name": "Eby Sim",
        "affiliation": "University of Sydney; Centre for Infectious Diseases and Microbiology - Public Health; NSW Health Pathology - ICPMR"
      }
    ]
  },
  "dx.doi.org/10.17504/protocols.io.bdtni6me": {
    "url": "sars-cov-2-virus-plaque-assays-biosafety-level-3-bdtni6me",
    "name": "SARS-CoV-2 virus plaque assays [Biosafety Level 3]",
    "status": 2,
    "author": [
      {
        "name": "tenOever Lab",
        "affiliation": "Icahn School of Medicine at Mount Sinai"
      }
    ]
  }
}
sundaram-covid19-biohack commented 4 years ago

@andrewsu I plan to resume this effort this weekend.
My planned next steps:

andrewsu commented 4 years ago

Hi @sundaram-covid19-biohack just wanted to check if you'd made any progress on the protocol crawler? No problem if your time constraints mean you can't contribute at the moment -- we can try to find someone to pick up where you left off. Just let us know please. Thanks!

sundaram-covid19-biohack commented 4 years ago

Looking into how to implement support for retrieving protocol data from Nature Protocols.

sundaram-covid19-biohack commented 4 years ago

FYI -

  1. have established an API key
  2. have reached out to Springer to get confirmation on how to properly constrain based on journal and journalid query parameters
andrewsu commented 4 years ago

Jay, have you looked at items 3 and 4 on the list above? Getting protocols.io wrapped up will allow us to move forward with later steps. We can add in other protocol providers later... Thanks!

flaneuse commented 4 years ago

Update: protocols.io complete; Nature Protocols hasn't been incorporated yet