DADIS synchronization workflow

marius-mather commented 8 months ago

Work in progress: I'm starting on this now, but have a questions/issues to resolve before I have anything solid.

@sabrinatoro The main question I need to resolve is what the input and output data needs to look like. I understand that the spreadsheet hosted on Google Sheets is the source of truth for the ontology. However, for a few reasons, it would be much easier to just read a flat spreadsheet like https://github.com/monarch-initiative/vertebrate-breed-ontology/blob/master/src/ontology/components/dadistransbound.tsv, and output another flat spreadsheet, in Excel or TSV format.

In particular, it's not easy to update the existing Google Sheets document via a Python script, preserving formatting etc..

Is it viable to write update/synchronization scripts that read the TSV files and output a modified TSV file? If not, can you point to examples in other ontologies for updating Google Sheets data that I can work off?

sabrinatoro commented 8 months ago

Thank you @marius-mather @matentzn, could you please share your thoughts on this subject? Thank you!

matentzn commented 8 months ago

Yes, I agree that we should operate solely on TSV files for the updating process. @anitacaron is working on a permanent solution to the table problem, but it will take till the end of November to get a working prototype. In the meantime, I can teach you @sabrinatoro to use cogs.

I think one thing we will have to let go off is the (admittedly convenient) colouring schemes of our tables. It is very hard to maintain that in a table that is half maintained automatically, and the other half programmatically.

@sabrinatoro is it feasible to have two separate templates: one for the DADIS sync and one for manual curation?

sabrinatoro commented 8 months ago

@matentzn

the Color scheme is not crucial: it is nice for us to navigate, but not necessarily to keep.

two separate templates: one for the DADIS sync and one for manual curation? I think we could make it happen. But I would like to discuss this in more detail, especially the connection between the DADIS spreadsheet and the ontology source of truth (DADIS that was manually curated).

marius-mather commented 8 months ago

I now have a working Python script that can take the dadistransbound.tsv file as input, match it to DADIS, and output it to a new TSV file. I've tried to write it in a way where it can be rerun in future, but still need to look into the details of how to do that in an ontology workflow (including installing Python dependencies and storing secrets like the DADIS API key).

To run the script, you can do:

python find_dadis_transboundary_ids.py \
    --input_filename data/dadistransbound.tsv \
    --output_filename example_transboundary_ids.tsv \
    --dadis_api_key $DADIS_API_KEY

Example output from the script is here: example_transboundary_ids.xlsx. NOTE: This can't be used to copy-paste into the Google Sheets version, I encountered multiple rows for VBO:0000991 (Icelandic Pony) in the input spreadsheet, and had to drop the duplicate to process the file, so the order of entries will probably be slightly out of sync with the input.

matentzn commented 8 months ago

@marius-mather awesome!

including installing Python dependencies and storing secrets like the DADIS API key

Two simple solutions are:

We use Github Action secretes and implement the Dadis pipeline as a GitHub action. I can do that when we are ready
We just say in the editors SOP that the DADIS APIKEY must be provided in the an run.sh.env (ODKs environment file) for this to work.

I encountered multiple rows for VBO:0000991 (Icelandic Pony) in the input spreadsheet

Is this the only duplicate row? If so, @sabrinatoro can you just drop one of them manually to be sure we are dropping the right thing?

franknic commented 8 months ago

Greetings, All I have read through everything above and am very impressed with, and very grateful for, the progress that has been made. I am very happy to do the manual checking of “duplicates”, if that would be helpful Best wishes Frank

marius-mather commented 5 months ago

@matentzn I've updated the transboundary matching script after discussing it with Sabrina - it should now be adding the dadis_transboundary_id for all entries where this is possible, and outputting a modified TSV with the dadis_transboundary_id column added (if this column needs a label in the second row, that should be easy to add).

It should be ready to review

In terms of automating this workflow, the questions I have would be:

How to pass the DADIS API key to the script - the script just takes it as a command line argument so this should be straightforward to do, I assume we can use GitHub's secrets system.
How to set up the required Python libraries - the dependencies for the script are basically pandas>=2.0, pydantic>=2.0, requests, but I can write a specific requirements.txt or environment.yml file if needed.

marius-mather commented 5 months ago

I've added a script that does the same matching process for the VBO breed-country entries - this one is actually easier to do, for whatever reason. The script can be run with:

python find_dadis_local_ids.py --input_filename data/dadisbreedcountry.tsv --output_filename example_local_ids.tsv --dadis_api_key $DADIS_API_KEY

I've attached the example output as an Excel file, it currently finds a DADIS match for 14,537/15,143 of the VBO breed-country entries in dadisbreedcountry.tsv

example_local_ids.xlsx

monarch-initiative / vertebrate-breed-ontology

DADIS synchronization workflow #121