Investigate VBO term labels

sabrinatoro commented 1 year ago

Right now, in order to ensure unique labels, we create labels by concatenating

"breed name/country/species" (format: Breed, Country; Species) for local breeds
"breed name/species" (format: Breed; Species) for transboundary breed

We have had a lot of feedback about how this concatenation is not great (ie everyone hates it), but we couldn't find an alternative that would both allow less cumbersome labels, and ensure the uniqueness of the labels.

Investigate the following:

for transboundary breeds: use only transboundary name
- if there is duplication, add the species name
for children of transboundary breed: use breed name + country
- if duplication, add the species name
For local breeds: use only breed name
- if there is duplication, then use breed name+country
  - if there is duplication, then use breed name+country+species
    - if there is duplication, use name+country+transboundry name instead of species *** (only 4 terms, these might need to be merged or reviewed)

sabrinatoro commented 1 year ago

Investigation revealed:

TRANSBOUNDARY BREEDS (total = 1683 terms)

using only breed name: ---> 1560 unique labels ---> 123 duplicated labels -----------> for these 123 terms, used name+species --> no duplication

BREEDS (total = 15142 terms)

Children of transboundary: (8121 terms)

using breed name+country: (which made sense since these are instances of the transboundary reported in specific country) ---> 7865 unique labels ---> 256 duplicated labels -----------> for these 256 terms, used name+country+spec ---> no duplication (well, there are 4 terms duplicated which might be error in the data, this is irrelevant for this issue though)

local breeds (not child of transboundary): (7021 terms)

using breed name: ---> 6128 unique labels ---> 893 duplicated labels -----------> for these 893 terms, used name+country ------------------> 463 unique labels ------------------> 430 duplicated labels ----------------------------> 430 duplicated labels, used name+country+spec ---> no duplication

In summary, we 'could', using the logic above, reduce the number of "cumbersome" labels in the ontology, BUT there is no way to completely avoid labels created by concatenation of name/country/species. HOWEVER, 1- the process above is laborious, especially since we are working in spreadsheets 2- it will be difficult to maintain consistency in the way we label terms (this might not be a big problem) 3- it will be make adding new terms much more difficult and laborious: every time we will add new terms, we would have to go through the whole logic to check for duplication,... (see 1-)

I know that every single person who has reviewed the ontology had a strong issue with the label format. However, all we could do is reduce the number of labels with "weird" format, doing so is a lot of work, and more importantly, it would offer a lot of opportunities for mistakes and inconsistency. Therefore I am reluctant to go there.

That being said:

we could review the characters we are using (e.g. would parentheses be better than semicolon?)
could we add the "most common name" as a special synonym or an alternative name (ie an annotation that would be more than a synonym, but could be duplicated between terms)?

sabrinatoro commented 1 year ago

@nicolevasilevsky @matentzn @cmungall Could you please share your opinion on this issue? Do you think that limiting (not eliminating) the awkward label format is worth the laborious work? Also, do you have suggestions about :

a different format that would avoid characters such as semicolons
an annotation that would be more than an exact synonym, but could be duplicated and would allow users to find "easily" their term. Thank you!

zhilianghu commented 1 year ago

About breed naming issues:

Actual animal breed names can be more complicated than the cases where country names are involved, considering this multi-country, multi-language, across discipline issues.

I had in my earlier years experiences doing field investigations in 1980s as part of the state's livestock breeds resources survey. I guess I had some exposure to those problems.

I humble if we can have a standard "scientific breed name" plus a field to hold "alternative or custom breeds names" (which can be multiple)? In QTL data curation works we brought together multiple trait ontologies to cover what we encounter, and yet we have to add a "reported trait name" from literatures we get data from, as it's a reality of life although the research community has been advised to use ontology terms. You need to make connections - I guess this is one of the utilities VBO is for.

Zhiliang

-----Original Message----- From: sabrinatoro @.> Reply-To: monarch-initiative/vertebrate-breed-ontology @.> Date: Wed, Jul 20, 2022 at 07:23 PM To: monarch-initiative/vertebrate-breed-ontology @.> Cc: Subscribed @.> Subject: Re: [monarch-initiative/vertebrate-breed-ontology] Investigate VBO term labels (Issue #37)

Investigation revealed: TRANSBOUNDARY BREEDS (total = 1683 terms)using only breed name: ---> 1560 unique labels ---> 123 duplicated labels -----------> for these 123 terms, used name+species --> no duplication BREEDS (total = 15142 terms)Children of transboundary: (8121 terms)using breed name+country: (which made sense since these are instances of the transboundary reported in specific country) ---> 7865 unique labels ---> 256 duplicated labels -----------> for these 256 terms, used name+country+spec ---> no duplication (well, there are 4 terms duplicated which might be error in the data, this is irrelevant for this issue though) local breeds (not child of transboundary): (7021 terms)using breed name: ---> 6128 unique labels ---> 893 duplicated labels -----------> for these 893 terms, used name+country ------------------> 463 unique labels ------------------> 430 duplicated labels ----------------------------> 430 duplicated labels, used name+country+spec ---> no duplication In summary, we 'could', using the logic above, reduce the number of "cumbersome" labels in the ontology, BUT there is no way to completely avoid labels created by concatenation of name/country/species. HOWEVER, 1- the process above is laborious, especially since we are working in spreadsheets 2- it will be difficult to maintain consistency in the way we label terms (this might not be a big problem) 3- it will be make adding new terms much more difficult and laborious: every time we will add new terms, we would have to go through the whole logic to check for duplication,... (see 1-) I know that every single person who has reviewed the ontology had a strong issue with the label format. However, all we could do is reduce the number of labels with "weird" format, doing so is a lot of work, and more importantly, it would offer a lot of opportunities for mistakes and inconsistency. Therefore I am reluctant to go there. That being said:

we could review the characters we are using (e.g. would parentheses be better than semicolon?)
could we add the "most common name" as a special synonym or an alternative name (ie an annotation that would be more than a synonym, but could be duplicated between terms)?

— Reply to this email directly, view it on GitHub https://github.com/monarch-initiative/vertebrate-breed-ontology/issues/37#issuecomment-1190899680, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADVDWXU2CBA6LJYG3YQH6QDVVCKBDANCNFSM54BY3FCQ. You are receiving this because you are subscribed to this thread.Message ID: @.> [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/monarch-initiative/vertebrate-breed-ontology/issues/37#issuecomment-1190899680", "url": "https://github.com/monarch-initiative/vertebrate-breed-ontology/issues/37#issuecomment-1190899680", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.***": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

sabrinatoro commented 1 year ago

Discussion with Chris Mungal:

We should automate: if name is unique, apply label; if name is not unique, concatenate
Characters: it might be better to use parentheses instead of semicolon
"common name" : should be added as an exact synonym with a new synonym type property.

franknic commented 1 year ago

Sabrina and Chris: all those suggestions are excellent and should be very helpful. Zhiliang: regarding "multi-country, multi-language, across discipline issues", can you suggest any relevant information that is not already included in DADIS? Regarding "scientific breed name", can you explain how this concept could be actually put into practice, especially given the lack of scientific basis of almost all breeds? Regarding "alternative or custom breeds names", can you suggest a better strategy than simply using the DADIS fields 'most common name' and 'other name', knowing that the latter contains multiple names?

sabrinatoro commented 1 year ago

I have investigated several options for new names. The issues with all options are long-term maintenance (every time we have a new breed, we would have to review all labels) and consistency. In addition, when adding the cat breeds, it was clear that the addition of the species name in the label was necessary in order to make the distinction between species, but also between other non-breed entities (e.g. "Cyprus" could be a breed of cat, ass, cattle,..., but also a country). As a consequence, we will keep the original strategy for label terms. The "most common name" was added as a synonym, with a "most common name" synonym type.

monarch-initiative / vertebrate-breed-ontology