monarch-initiative / vertebrate-breed-ontology

https://monarch-initiative.github.io/vertebrate-breed-ontology/
8 stars 0 forks source link

Strategy for regularly updating VBO from DAD-IS #25

Open franknic opened 2 years ago

franknic commented 2 years ago

For new term requests, please provide the following information:

Preferred term label

(e.g., Asplenia)

Synonyms

(e.g., Absent spleen)

Textual definition

the definition should be understandable even for non-specialists. Include a PubMed ID to refer to any relevant article that provides additional information about the suggested term.

Suggested parent term

Please look in the hierarchy in a browser such as OLS

Attribution

If you would like a nanoattribution, please indicate your ORCID id

franknic commented 2 years ago

Greetings, Sabrina! I have opened this new issue partly to show that I have made some progress in navigating GitHub :), and partly because we do have to give some thought on how to update VBO from DAD-IS. I have recently learned from our main DAD-IS contact, Gregoire Leroy, that any National Coordinator can make changes at any time, and all such changes are automatically incorporated into the database, and hence into downloads. If I understand correctly, this means that we will have to come up with a strategy for comparing regular downloads of the Excel spreadsheet, looking for any changes; and then incorporate those changes into VBO. This sounds to me like a non-trivial challenge! There is no immediate urgency for this. Indeed, fixing the remaining weird characters and incorporating dogs and cats are far more important in the immediate future.

sabrinatoro commented 1 year ago

@franknic This is correct, we need to have a strategy to 1) get the latest version of DAD-IS, and figure out what was changed compared to the last list (so we know what was changed compared to what is in VBO) 2) set up a workflow where this process is done automatically 3) set up a workflow to figure out how updates are added to the VBO system (which currently is based on spreadsheets) This is a huge discussion and will require a lot of technical help.

Something to also discuss is whether DAD-IS would be willing to adopt VBO. If they do, this synchronization process will be 1000 times smoother and more importantly less error-prone.

franknic commented 1 year ago

Thank you very much, Sabrina!! I had forgotten that I actually created this issue last May!! Your suggested strategy is really helpful: far better than I could have come up with! And yes, a major aim for us is to be able to show our DADIS colleagues that VBO has so many advantages that they will be convinced to adopt VBO. I'm sure we can rise to this challenge, which will be a really important criteria for the success of VBO. I've tried to add Gregoire to this issue, but can't see how to do it, because his name doesn't appear in the drop-down list. Would you please add him? His id is LeroyGregoire

LeroyGregoire commented 1 year ago

Hello,

Can you elaborate on what would exactly mean adopt VBO from a DAD-IS perspective?

Gregoire

sabrinatoro commented 1 year ago

Thank you for the question @LeroyGregoire. One thing that would be extremely helpful (especially for synchronization) is if the entries in DADIS had a permanent unique identifier (ie identifier that will not change over time). This way we can more easily keep track of an entry in DADIS and its correspondent in VBO. If DADIS entries have permanent unique identifiers, it will be much easier to identify the new entries, and which entries have been modified (in contrast to having to write complicated scripts to determine which entry in DADIS corresponds to which VBO entry), and synchronize with VBO. If DADIS does not currently use permanent unique identifiers, it would be easier/simpler to use the existing VBO id since they already exist. Therefore you wouldn't need to create new one.

LeroyGregoire commented 1 year ago

Dear Sabrina, Thanks for the clarification! Currently, DAD-IS does not use unique identifiers either for National Breed Populations or Transboundary breeds. I will discuss the two options with our IT team for our meeting next week.

Sincerely

Gregoire

franknic commented 1 year ago

Thank you, Gregoire, It may be helpful to note that the current spreadsheets from which each version of VBO is generated are viewable at https://docs.google.com/spreadsheets/d/1KKJWPCY5jR72IukfX9OOyHK9L8TH2Qs4hcmgT-0kR6U/edit?usp=sharing. (I can't get this URL to work from the hyperlink, but it does work if you cut and paste the URL) The first sheet is labelled "ROBOT: ncbitransbound" and the second sheet is labelled "ROBOT: ncbibreeds". The VBO ids are in the first column of each sheet.

sabrinatoro commented 1 year ago

@matentzn and Marius. The following might help: data from DADIS can be found here: https://docs.google.com/spreadsheets/d/1KKJWPCY5jR72IukfX9OOyHK9L8TH2Qs4hcmgT-0kR6U/edit#gid=896730834 - Sheet: ROBOT: ncbitransbound

- Sheet: ROBOT: ncbibreeds

If needed, we can re-create these columns to be used specifically for synchronization.

Some transboundary in the Dog breed spreadsheet: https://docs.google.com/spreadsheets/d/1kRvsbIDJtX40I41366FQcsw_5HErzcERoQwdRnKaB60/edit#gid=0

sabrinatoro commented 1 year ago

Additional information can be found in the shared folder: https://drive.google.com/drive/folders/1MslHFZUpvgZBfS9oQ0ZqlkzntGGL7Lb6

franknic commented 1 year ago

Thank you very much for all you have done, and are doing, on this, Sabrina! It is much appreciated.

franknic commented 1 year ago

Hi Sabrina Marius has produced the first results of matching VBO entries to DADIS entries that now include their numerical ids. It would be really helpful if we could continue this conversation here, with Marius also involved. Would you please register Marius (mmat6620)?

matentzn commented 1 year ago

@franknic I already tried to add Marius, but I cant find him. https://github.com/mmat6620 does not exist. I am pretty sure this is him https://github.com/marius-mather, can you confirm?

franknic commented 1 year ago

Thank you, Nico Yes, that looks like it. Having got it wrong before, I'm ccing Marius in this message Marius: can you please confirm?


From: Nico Matentzoglu @.> Sent: Wednesday, June 7, 2023 8:47:04 pm To: monarch-initiative/vertebrate-breed-ontology @.> Cc: Frank Nicholas @.>; Mention @.> Subject: Re: [monarch-initiative/vertebrate-breed-ontology] Strategy for regularly updating VBO from DAD-IS (Issue #25)

@franknichttps://github.com/franknic I already tried to add Marius, but I cant find him. https://github.com/mmat6620 does not exist. I am pretty sure this is him https://github.com/marius-mather, can you confirm?

— Reply to this email directly, view it on GitHubhttps://github.com/monarch-initiative/vertebrate-breed-ontology/issues/25#issuecomment-1580496851, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AS4O6JBSVBGEYS37NLEECRDXKBL2BANCNFSM5VBVHPQA. You are receiving this because you were mentioned.Message ID: @.***>

marius-mather commented 1 year ago

Hi Nico, yes, this is my account on github.com (mmat6620 is on our internal university GitHub).

franknic commented 1 year ago

Thanks, Marius I was looking in the wrong place! And thanks, Nico

Read about the Pioneers of Mendelian inheritance in animalshttps://omia.org/key_articles/pmia/ For an up-to-date list of animal traits/disorders characterised at the DNA level, and for tables of likely causal variants, visit Online Mendelian Inheritance in Animals (OMIA)https://omia.org/ OMIA has been on the internet for 27 years, since 26 May 1995. Share its 25th birthday celebrationshttps://www.sydney.edu.au/science/news-and-events/2020/05/25/online-mendelian-inheritance-animals.html To help retain OMIA as a freely-available resource, please make a donationhttps://omia.org/donate/ To join the OMIA Support Group, register at https://www.animalgenome.org/community/omia-support/

From: marius-mather @.> Sent: Thursday, June 8, 2023 10:54 AM To: monarch-initiative/vertebrate-breed-ontology @.> Cc: Frank Nicholas @.>; Mention @.> Subject: Re: [monarch-initiative/vertebrate-breed-ontology] Strategy for regularly updating VBO from DAD-IS (Issue #25)

Hi Nico, yes, this is my account on github.com (mmat6620 is on our internal university GitHub).

— Reply to this email directly, view it on GitHubhttps://github.com/monarch-initiative/vertebrate-breed-ontology/issues/25#issuecomment-1581724583, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AS4O6JHYBZUUNPWHF6JRTD3XKEPD5ANCNFSM5VBVHPQA. You are receiving this because you were mentioned.Message ID: @.**@.>>

matentzn commented 1 year ago

@marius-mather I gave you write access!

franknic commented 1 year ago

Thank you very much, Nico. We will have some feedback very soon.

franknic commented 1 year ago

Marius has made great progress in matching VBO with DADIS at the transboundary level. Two questions for Gregoire have arisen.

  1. When Marius interrogates DADIS, is the information he is interrogating exactly the same as the information on the DADIS website at that time and in a metadata download done at that time?
  2. So far as I can tell from Marius’ first matchings, more than two hundred transboundary breeds that existed in DADIS in Dec 2021 (and hence are included in VBO) have been deleted in the last 18 months. However, when Marius compared DADIS national breed populations with transboundary breed names that no longer exist in DADIS, it seems that some of the breeds that previously had a transboundary name are still listed in DADIS as existing in more than one country, e.g. Grauvieh is no longer a cattle transboundary name but the cattle breed Grauvieh is still listed as existing as a local breed in DEU and in DNK; Criollo is no longer a horse transboundary name but the horse breed Criollo is still listed as existing as a local breed in each of DEU, DOM, GTM, NIC, and SLV. I appreciate that I may have misunderstood something. And I understand that the information provided by National Coordinators would tend to concentrate on the country level.
LeroyGregoire commented 1 year ago

Dear Frank,

  1. website and download tools (we will change the wording metadata in the next weeks by the way as it is improper) are supposed to synchronized on the fly.

  2. This is linked to an old issue that we had (and already discussed) regarding the fact that some national breed populations had their "transboundary name" field filled with a name not linked to the transboundary breed list. In April 2022, following a process mostly in collaboration with the European Focal Point, we processed to the update of the transboundary breed list and cleaning of the field, when for instance a given name was not in the list and provided only by one country. As the process was automatized, we may have made some mistakes. In the exemple of Grauview the situation is complex. Grauview is the german name for Grey, and may correspond to actually completely different breeds (Tyrolian Grey or Hungarian Grey for instance). Therefore we do not know exactly to what breed corresponds the German (probably Tyrolian, which would therefore be added as transboundary), and even less the Danish national population. This given an idea of the complexity of the situation.

My suggestion would be in that matter to use the API developed to get the updated transboundary name list. Our IT colleagues will provide you the necessary token upon demand.

I hope it helps

Gregoire

franknic commented 1 year ago

Thank you very much, Gregoire. The synchronisation of website and download is very helpful. I do recall the previous discussion of the challenge with transboundary breeds, and your above explanation is also greatly appreciated. Marius' downloads were at the beginning of June. Would these have captured the updated transboundary list? We shall ponder the transboundary challenges, and in the meantime, Marius will concentrate on synchronising VBO breed-country entries with DADIS national breed population entries. With thanks Frank

franknic commented 10 months ago

With substantial progress having been made in the synchronisation strategy, it will be useful to have the strategy included in this issue. From a Sabrina email of 20 July 2023: Dear all, Katie and I met with Nico and discussed the VBO synchronization with DADIS. We all agreed on the following plan (I will also add these into GitHub issues). There are the 3 steps :

STEP 1: Connect DADIS record with VBO id. Goal: for all VBO terms originating from DADIS, add DADIS ids to the corresponding VBO ids. To do: a. In the spreadsheets, add a column for the DADIS id, and add a column to indicate when a vbo term is “no longer in DADIS”. b. when a DADIS correspondent can be found for a VBO term, add the DADIS id in the “DADIS id” column. c. when no DADIS can be found for a VBO term originally from DADIS: report to the curation team. The curation team will review the list manually and determine whether —-- the record has disappeared from DADIS; in this case, we will add a “no longer in DADIS” annotation. —-- the record in DADIS has changed; in this case, the correct DADIS id will be added to the correct VBO term.

When Step 1 is completed, please inform the VBO team, so we can make sure that everyone has a chance to review before moving to the next step.

STEP 2: Add all the new DADIS records that are not in VBO. Goal: add new breeds that are not represented in VBO. To do: a. Add a new row with the DADIS record. b. When we are ready to work on this, let’s talk about what information we should bring in VBO (e.g. name, country, origin,..), and where to add it (ie having specific column dedicated for the DADIS information)

STEP 3: Update VBO record when DADIS has updated the record. WE ARE NOT READY FOR THIS STEP. Let’s discuss when we are ready for this To do: a. determine what information we want to update based on DADIS data b. create clear SOP to specify in the ontology whether we use DADIS data or we have overwritten it with manual curated data.

franknic commented 10 months ago

As planned, Marius and Imke and Frank met this morning, to discuss steps 1a and 1b of the strategy. DADIS has breed ids as follows: a. Transboundary: x-yy where x and y are digits corresponding, respectively, to species and transboundary breed within species, e.g. Dexter = 7-76 where 7 indicates species = cattle and Dexter is breed 76 within cattle b. Breed-country: species id as for transboundaries, plus xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, which is a unique alphanumeric id for each breed-country (appears to be a random combination of digits and alpha characters for each breed-country, i.e. uninformative) Marius has done the following matching: a. 1055/1677 VBO transboundaries match to a DADIS id b. 14733/15148 VBO breed-countries match to a DADIS id Add DADIS ids to the corresponding VBO ids a. Marius will add the following columns to the black DADIS section of the following update sheets: "UPDATE DADIS Transboundary" sheet of the file update DADIS: i. DADIS species id ii. DADIS breed-within-species id "UPDATE DADIS country-breed" sheet of the file update DADIS: i. DADIS species id ii. DADIS breed-country id For the very small number of VBO dog breeds with a matching DADIS id, Marius will also add relevant columns to the "UPDATE dog breeds" sheet of the file update dog breeds vbo b. Marius will then populate the new columns with relevant matching DADIS id information Generate a list of DADIS breeds ids that do not correspond to a VBO id This list will be examined manually. Please feel free to change this strategy if I have misunderstood or misrepresented anything.

marius-mather commented 5 months ago

Hi all, I think we just need a final check before doing the initial sync of VBO terms to DADIS ids - I merged the code for it a little while ago but needed to collate some information before running it - it looks like Nico may have already attempted to run it but there may be some final bugs to work out - it doesn't look like the results were committed correctly.

Basically, we have two scripts, one for matching the transboundary ids and one for the breed-country entries.

The transboundary script should update dadistransbound.tsv, adding a dadis_transboundary_id column. Output from running the script on the current dadistranbound.tsv is here:

vbo_transboundary_ids.xlsx

(note: Excel is very keen to turn DADIS's IDs, which are of the form "1-3", into dates, but this should not affect the ontology pipeline)

The breed-country script should update dadisbreedcountry.tsv and add three columns: dadis_breed_id, dadis_transboundary_id, and dadis_update_date. Example output from the current TSV is here:

vbo_local_ids.xlsx

While doing this we can also output a spreadsheet of the DADIS entries that we have not found a match for in VBO:

dadis_unmatched.xlsx

If this looks good to everyone I can work out what's happening with the GitHub Action and trigger it to update the actual ontology files.

franknic commented 5 months ago

Thank you very much, Marius and Nico Imke and I will have a look at the outputs and will report back asap

sabrinatoro commented 5 months ago

Thank you Marius. I will take a look, but I want to bring your attention to a potential issue: When I open the files you shared, the dadis_transboundary_id is displayed as dates for most of the entries (this is something that Excel does automatically). Is there a way we can make sure this is not happening? Could this create problems down the line?

I am tagging @matentzn to make sure he saw the message from Marius.

marius-mather commented 5 months ago

Yes, I had noticed that Excel was displaying these as dates - I don't think it will be an issue since the TSV file just contains the original ID, like 4-23. If we think it could cause issues I can probably do some extra quoting around that column to force it to be treated as a string.

franknic commented 5 months ago

Imke and Frank reporting back: We have now looked at all three outputs, and understand what is being done, which all looks fine, except for the date issue, but we accept your advice on that, Marius. It is great to have DADIS breed ids now in VBO files. The unmatched file contains a relatively large number of entries (594) but this is largely a reflection of the reality that VBO hasn't picked up any changes in DADIS since we started years ago. So, this list will have to be handled manually. But future lists will be much shorter, and the differences may be able to be captured electronically to some extent. But this is in the future. At our next VBO monthly meeting, we can discuss how best to handle the unmatched entries.

marius-mather commented 5 months ago

OK, I'm going to go ahead and run the initial sync now - let me know if there are any issues with the changes it makes to the TSV files.

marius-mather commented 5 months ago

The updated files are in the pull request here, for review: #149

matentzn commented 5 months ago

I am ooo this week, will check Monday!

matentzn commented 5 months ago

I added a few comments that should be fixed, but noting wild.

Just a cool tool in case you dont know it:

pip install daff
daff table1.tsv table2.tsv --output diff.html