Closed marius-mather closed 5 months ago
Thank you @marius-mather @matentzn, could you please share your thoughts on this subject? Thank you!
Yes, I agree that we should operate solely on TSV files for the updating process. @anitacaron is working on a permanent solution to the table problem, but it will take till the end of November to get a working prototype. In the meantime, I can teach you @sabrinatoro to use cogs.
I think one thing we will have to let go off is the (admittedly convenient) colouring schemes of our tables. It is very hard to maintain that in a table that is half maintained automatically, and the other half programmatically.
@sabrinatoro is it feasible to have two separate templates: one for the DADIS sync and one for manual curation?
@matentzn
two separate templates: one for the DADIS sync and one for manual curation? I think we could make it happen. But I would like to discuss this in more detail, especially the connection between the DADIS spreadsheet and the ontology source of truth (DADIS that was manually curated).
I now have a working Python script that can take the dadistransbound.tsv file as input, match it to DADIS, and output it to a new TSV file. I've tried to write it in a way where it can be rerun in future, but still need to look into the details of how to do that in an ontology workflow (including installing Python dependencies and storing secrets like the DADIS API key).
To run the script, you can do:
python find_dadis_transboundary_ids.py \
--input_filename data/dadistransbound.tsv \
--output_filename example_transboundary_ids.tsv \
--dadis_api_key $DADIS_API_KEY
Example output from the script is here: example_transboundary_ids.xlsx. NOTE: This can't be used to copy-paste into the Google Sheets version, I encountered multiple rows for VBO:0000991 (Icelandic Pony) in the input spreadsheet, and had to drop the duplicate to process the file, so the order of entries will probably be slightly out of sync with the input.
@marius-mather awesome!
including installing Python dependencies and storing secrets like the DADIS API key
Two simple solutions are:
I encountered multiple rows for VBO:0000991 (Icelandic Pony) in the input spreadsheet
Is this the only duplicate row? If so, @sabrinatoro can you just drop one of them manually to be sure we are dropping the right thing?
Greetings, All I have read through everything above and am very impressed with, and very grateful for, the progress that has been made. I am very happy to do the manual checking of “duplicates”, if that would be helpful Best wishes Frank
@matentzn I've updated the transboundary matching script after discussing it with Sabrina - it should now be adding the dadis_transboundary_id
for all entries where this is possible, and outputting a modified TSV with the dadis_transboundary_id
column added (if this column needs a label in the second row, that should be easy to add).
It should be ready to review
In terms of automating this workflow, the questions I have would be:
pandas>=2.0
, pydantic>=2.0
, requests
, but I can write a specific requirements.txt
or environment.yml
file if needed.I've added a script that does the same matching process for the VBO breed-country entries - this one is actually easier to do, for whatever reason. The script can be run with:
python find_dadis_local_ids.py --input_filename data/dadisbreedcountry.tsv --output_filename example_local_ids.tsv --dadis_api_key $DADIS_API_KEY
I've attached the example output as an Excel file, it currently finds a DADIS match for 14,537/15,143 of the VBO breed-country entries in dadisbreedcountry.tsv
Work in progress: I'm starting on this now, but have a questions/issues to resolve before I have anything solid.
@sabrinatoro The main question I need to resolve is what the input and output data needs to look like. I understand that the spreadsheet hosted on Google Sheets is the source of truth for the ontology. However, for a few reasons, it would be much easier to just read a flat spreadsheet like https://github.com/monarch-initiative/vertebrate-breed-ontology/blob/master/src/ontology/components/dadistransbound.tsv, and output another flat spreadsheet, in Excel or TSV format.
In particular, it's not easy to update the existing Google Sheets document via a Python script, preserving formatting etc..
Is it viable to write update/synchronization scripts that read the TSV files and output a modified TSV file? If not, can you point to examples in other ontologies for updating Google Sheets data that I can work off?