Open flaneuse opened 4 years ago
In addition to any protocol records in the gdoc linked above, will also want to automate searches in protocols.io (API: https://apidoc.protocols.io/) and Nature Protocols (API: https://dev.springernature.com/)
For example, a protocols.io search might look like this: https://www.protocols.io/api/v3/protocols?filter=%22public%22&key=%22covid-19%22
We would want to parse that in to a JSON file that is a list of dicts, where the top-level keys are the doi
and the remaining structure corresponds to the data schema defined in #3...
@andrewsu ,
Thank you for the follow-up.
Happy to work on a data parser for this and other resources.
I will review the info cited above (in this issue) and follow-up with questions.
Hi @andrewsu
If I understand correctly, you're seeking software solution for:
Please advise whether there exist (requirements) specification documents that outline:
This will help inform the software implementation. Please let me know if I've misunderstood. Happy to connect on Slack to discuss in person if easier.
Thank you, Jay
@sundaram-covid19-biohack that sounds exactly correct. We have a prototype of a protocols schema written, but we have yet to do the mapping from protocols.io/Nature Protocols to this schema.
I would suggest:
Let us know if you have any questions. Will defer to @andrewsu on connecting on Slack.
@flaneuse Thank you for the clarification.
I've implemented the first program for retrieving the protocols.io JSON.
Am planning to fork this code-base and then commit the first Python program (for retrieving the protocols.io search results in JSON format) along with a control file (covid_terms.txt).
Please advise if you prefer method for sharing the code.
FYI - pseudocode:
for term in covid_terms:
url = ' https://www.protocols.io/api/v3/protocols?filter="public"&key=" + term + '"'
outfile = outdir + term.replace(' ','')
wget(url, outfile
@flaneuse Please provide a similar URL for the Nature Protocols.
@flaneuse I've attached a reformatted (added newlines and indentation) protocols.io JSON for "covid19".
Please review and advise which fields/values should be extracted and written to the bioschemas JSON-LD file.
Thank you Jay
@sundaram-covid19-biohack that sounds great. For now, feel free to throw your code into a separate folder; we haven't thought about organization yet but will organize in the future. In general, pull requests from your fork should work nicely for review.
For Nature Protocols, you'll need to use the Springer Nature Metadata API. I'm in the middle of something atm so I don't have a URL off the top of my head, but you'll need to both specify the query and limit it just to protocols (the API includes journal articles, book chapters, protocols but we only want protocols).
We'll take a look at the mapping in a bit. Thanks!
@sundaram-covid19-biohack Thank you for taking a crack at this! As @flaneuse said, pull requests would be convenient to review.
The fields in the protocols schema prototype have comments that may help in deciding the mapping between protocols.io and the schema. Once you have a mapping that you think works, please submit a pull request and we can review it.
@flaneuse I've attached a reformatted (added newlines and indentation) protocols.io JSON for "covid19".
Please review and advise which fields/values should be extracted and written to the bioschemas JSON-LD file.
Thank you Jay
@sundaram-covid19-biohack is the .json-ld a random subset of 2 results, or all results? Seems like there should be more then 2 protocols.
Will defer to @andrewsu on connecting on Slack.
Let's try to stick to Github issues as much as possible in the interest of persistence and transparency. If real-time communication becomes important to work through particularly sticky issues, we can use slack or web conferencing...
Please advise whether there exist (requirements) specification documents that outline:
1) the mapping from protocols.io JSON to Bioschemas JSON-LD 2) the mapping from Nature Protocols JSON to Bioschemas JSON-LD
And just to reiterate what @gkarthik and @flaneuse wrote above, it would be great if you could take an initial crack at this mapping! Thanks @sundaram-covid19-biohack!
@gkarthik : for the record- I've derived the following 40 terms (and corresponding comments) from the protocols schema prototype and will use those to inform my derivation of a mapping from the protocols.io search results JSON content.
@gkarthik - my first attempt at deriving a mapping is here.
While I am happy to take the first pass at them, would like to request that an SME derive the mappings.
I can see that there is some author metadata available in the protocols.io JSON.
I'll work on that next.
Per the protocols schema prototype, it looks like there is only support for one author (see line 52).
Should the software preferentially select the first author?
The script that uses the mapping file to actually extract the data from the protocols.io JSON has already been implemented.
@flaneuse I've attached a reformatted (added newlines and indentation) protocols.io JSON for "covid19". Please review and advise which fields/values should be extracted and written to the bioschemas JSON-LD file. Thank you Jay protocols.io.covid19.reformatted.json.gz
@sundaram-covid19-biohack is the .json-ld a random subset of 2 results, or all results? Seems like there should be more then 2 protocols.
@flaneuse : Sorry I missed your previous comment.
Not sure why there are only 2 results.
That is what the software retrieved.
I can confirm that when you navigate to the same URL in the browser- only 2 results are available.
@flaneuse I've attached a reformatted (added newlines and indentation) protocols.io JSON for "covid19". Please review and advise which fields/values should be extracted and written to the bioschemas JSON-LD file. Thank you Jay protocols.io.covid19.reformatted.json.gz
@sundaram-covid19-biohack is the .json-ld a random subset of 2 results, or all results? Seems like there should be more then 2 protocols.
@flaneuse : Sorry I missed your previous comment. Not sure why there are only 2 results. That is what the software retrieved. I can confirm that when you navigate to the same URL in the browser- only 2 results are available.
Thanks @sundaram-covid19-biohack -- I wasn't sure if those were the results from the first query terms or all the query terms.
Per the protocols schema prototype, it looks like there is only support for one author (see line 52). Should the software preferentially select the first author?
@sundaram-covid19-biohack The owl:cardinality: many
attribute off of author indicates that author
should be an array of Authors, so the mapping file should map the list of authors scraped from protcols.io into an author array.
The script that uses the mapping file to actually extract the data from the protocols.io JSON has already been implemented.
@sundaram-covid19-biohack can you post the output of that script for us to look at?
@flaneuse Thank you for the pointer. Will adjust accordingly.
@andrewsu Planning to resume the effort a couple of hours from now.
Will post the requested output at that time.
Will also provide a readme that outlines how the software should be installed and executed.
@andrewsu sample invocation:
python extract_terms_from_protocols_io_json.py --protocols_json_file COVID19.reformatted.json \
--protocols_schema_mapping_file mapping/protocols_schema_mapping.txt \
--verbose
The following two required command-line parameters reference files committed to the repo: --protocols_json_file --protocols_schema_mapping_file
If --outfile is not specified, a default will be assigned.
Output file attached (I keep gzipping these files because Github does not allow .json files to be attached.)
Note that this does not yet include support for extracting the author data.
@andrewsu - Please see the corrected output (attached). @flaneuse - This includes the authors list. Will work on the author affiliation next. COVID19_extracted_data_v2.json.gz
Thanks @sundaram-covid19-biohack. Do you have a list of all the Object keys found in all the results you retrieved? (sorry if I missed this) If so, we can help with the mapping table you're generating. It'd be good to do the full join of all the properties available from protocols.io with the properties in our schema, to be able to see what protocols.io has that we don't, and what we would like that they don't have. Thanks!
@flaneuse - Please see attached output.
Contains the author.affiliation list.
Please let me know if I interpreted the protocols bioschema specification properly.
Suspecting that that the expectation is for the authors info to be a list of dictionaries with keys 'name' and 'affiliation'- as opposed to the result the software is currently producing.
COVID19_extracted_data_v3.json.gz
Thanks @sundaram-covid19-biohack. Do you have a list of all the Object keys found in all the results you retrieved? (sorry if I missed this) If so, we can help with the mapping table you're generating. It'd be good to do the full join of all the properties available from protocols.io with the properties in our schema, to be able to see what protocols.io has that we don't, and what we would like that they don't have. Thanks!
Here is the list of Object keys (and corresponding rdfs:comment values) that I derived from the protocols schema prototype.
I used that list to guide my manual inspection of the protocols.io results JSON in order to identify candidate keys.
For the bioschema object keys that did have a candidate in the the protocols.io output JSON, I provided a corresponding jsonpath search parameter in this mapping file.
The first column is the bioschema key/term and the second column is the json search parameter.
Note that no value in the second column means that I could not identify a viable candidate key in the protocols.io results JSON.
I think that file (mapping/protocols_schema_mapping.txt) contains the info you seek. (Please advise if I've misunderstood.)
Thus, moving forward- one would just need to update that particular mapping file (mapping/protocols_schema_mapping.txt) to affect the behavior of the data extractor program for the protocols.io JSON results.
Then, for each new/different JSON results type (e.g.: Nature Protocols), one would just need to define a new mapping file (e.g.: mapping/nature_protocols_schema_mapping.txt - not yet defined).
I can demonstrate this once I figure out how to retrieve the Nature Protocols JSON. Will try to resume in a couple of hours.
Thanks @sundaram-covid19-biohack. Do you have a list of all the Object keys found in all the results you retrieved? (sorry if I missed this) If so, we can help with the mapping table you're generating. It'd be good to do the full join of all the properties available from protocols.io with the properties in our schema, to be able to see what protocols.io has that we don't, and what we would like that they don't have. Thanks!
Also, the log file will indicate if the software could not find a key and/or value.
Thank you @sundaram-covid19-biohack, looking good! Two requested changes please...
First, can you reformat the output slightly? Currently you have this (abbreviated):
[
{
"url": "sars-cov-2-genome-sequencing-using-long-pooled-amp-befyjbpw",
"name": "SARS-CoV-2 Genome Sequencing Using Long Pooled Amplicons on Illumina Platforms",
"identifier": "dx.doi.org/10.17504/protocols.io.befyjbpw"
},
{
"url": "sars-cov-2-virus-plaque-assays-biosafety-level-3-bdtni6me",
"name": "SARS-CoV-2 virus plaque assays [Biosafety Level 3]",
"identifier": "dx.doi.org/10.17504/protocols.io.bdtni6me"
}
]
which should be updated to this:
{
"dx.doi.org/10.17504/protocols.io.befyjbpw": {
"url": "sars-cov-2-genome-sequencing-using-long-pooled-amp-befyjbpw",
"name": "SARS-CoV-2 Genome Sequencing Using Long Pooled Amplicons on Illumina Platforms"
},
"dx.doi.org/10.17504/protocols.io.bdtni6me": {
"url": "sars-cov-2-virus-plaque-assays-biosafety-level-3-bdtni6me",
"name": "SARS-CoV-2 virus plaque assays [Biosafety Level 3]"
}
}
Second, to prevent overloading the protocols.io server, can you add in a sleep
delay in between API calls? Default to 5 seconds please.
We may also ask you to change one thing with how the program is called, but just checking with the team on that...
Thank you @sundaram-covid19-biohack, looking good! Two requested changes please...
First, can you reformat the output slightly? Currently you have this (abbreviated):
@andrewsu - oversight on my part. will fix that next.
Second, to prevent overloading the protocols.io server, can you add in a
sleep
delay in between API calls? Default to 5 seconds please.
I will make it configurable with default 5 seconds. Please note that the protocols_tobioschemas.py executable does not make the request to the resource servers. It is the retrieve*.py executables that will send the requests. I'll update them to use the (configurable) 5 seconds rule
We may also ask you to change one thing with how the program is called, but just checking with the team on that...
Sure, do let me know.
I will set it up so that folks can just execute a shell script e.g.:
./protocols_to_bioschemas.sh
or
bash protocols_to_bioschemas.sh
or something along those lines.
@andrewsu - the output looks like this now:
{
"dx.doi.org/10.17504/protocols.io.befyjbpw": {
"url": "sars-cov-2-genome-sequencing-using-long-pooled-amp-befyjbpw",
"name": "SARS-CoV-2 Genome Sequencing Using Long Pooled Amplicons on Illumina Platforms",
"status": 2,
"author.name": [
"John-Sebastian Eden",
"Eby Sim"
],
"author.affiliation": [
"Westmead Institute for Medical Research; University of Sydney",
"University of Sydney; Centre for Infectious Diseases and Microbiology - Public Health; NSW Health Pathology - ICPMR"
]
},
"dx.doi.org/10.17504/protocols.io.bdtni6me": {
"url": "sars-cov-2-virus-plaque-assays-biosafety-level-3-bdtni6me",
"name": "SARS-CoV-2 virus plaque assays [Biosafety Level 3]",
"status": 2,
"author.name": [
"tenOever Lab"
],
"author.affiliation": [
"Icahn School of Medicine at Mount Sinai"
]
}
}
@sundaram-covid19-biohack great, thank you for the quick updates. To better integrate this with our existing infrastructure for scheduling and automation please also make these changes:
__init__.py
) with a name that reflects the data source (in this case, protocols.io)protocolsio_json_retriever.py
as dumper.py
extract_terms_from_protocols_io_json.py
as parser.py
load_data(input_file_or_input_dir)
) that returns JSON objects as a generator (better) or a listAgain, thank you for all your efforts on this!
oh, and regarding the author notation, yes, ideally the author info would be represented like this:
{
"dx.doi.org/10.17504/protocols.io.befyjbpw": {
"url": "sars-cov-2-genome-sequencing-using-long-pooled-amp-befyjbpw",
"name": "SARS-CoV-2 Genome Sequencing Using Long Pooled Amplicons on Illumina Platforms",
"status": 2,
"author": [
{
"name": "John-Sebastian Eden",
"affiliation": "Westmead Institute for Medical Research; University of Sydney"
},
{
"name": "Eby Sim",
"affiliation": "University of Sydney; Centre for Infectious Diseases and Microbiology - Public Health; NSW Health Pathology - ICPMR"
}
]
},
"dx.doi.org/10.17504/protocols.io.bdtni6me": {
"url": "sars-cov-2-virus-plaque-assays-biosafety-level-3-bdtni6me",
"name": "SARS-CoV-2 virus plaque assays [Biosafety Level 3]",
"status": 2,
"author": [
{
"name": "tenOever Lab",
"affiliation": "Icahn School of Medicine at Mount Sinai"
}
]
}
}
@andrewsu I plan to resume this effort this weekend.
My planned next steps:
address items 3 & 4 in your last comment posted in this thread
work on support for Nature Protocols
Hi @sundaram-covid19-biohack just wanted to check if you'd made any progress on the protocol crawler? No problem if your time constraints mean you can't contribute at the moment -- we can try to find someone to pick up where you left off. Just let us know please. Thanks!
Looking into how to implement support for retrieving protocol data from Nature Protocols.
FYI -
Jay, have you looked at items 3 and 4 on the list above? Getting protocols.io wrapped up will allow us to move forward with later steps. We can add in other protocol providers later... Thanks!
Update: protocols.io complete; Nature Protocols hasn't been incorporated yet
Using https://docs.google.com/spreadsheets/d/1FFyRhI5TeUb-B4t50HRZYF_XngRt_mfAIz82UQ3geFk/edit#gid=0