obophenotype / ncbitaxon

Build for NCBITaxon
BSD 3-Clause "New" or "Revised" License
24 stars 7 forks source link

All environmental samples have an exact synomym "environmental samples" #56

Closed cmungall closed 3 days ago

cmungall commented 2 years ago

E.g. http://purl.obolibrary.org/obo/NCBITaxon_743727

Even though our transform generally does not alter the source I think in this case we need to filter this

hrshdhgd commented 2 years ago

Issue arises from another project

jamesaoverton commented 2 years ago

That repo must be private, because I get a 404.

Issue arises from another project

cmungall commented 2 years ago

That's a repo we use for some internal text mining projects, where we may have corpi that people don't want shared yet. There isn't much context there, other than any time a text says "environmental sample" it false positive matches hundreds of "taxa" because that is the exact synonym that was assigned.

This would mark the first time we don't have a 100% isomorphic translation from the source, so it brings up various questions which are raised here:

In this case I would make a case an exception is justified

but we need a process for making these kinds of decisions for this ontology, I will send an email to obo-taxonomy and gather other feedback

nataled commented 2 years ago

Agree that 'environmental samples' should be an exception. That's not even a synonym for any actual taxon. If deleting it altogether is too much, then perhaps the information could be captured in a comment (or other mechanism). Another alternative is to change it from exact to related. I come across the same issue (for actual synonyms) in my automated processing of UniProtKB for PRO. For these, I examine the whole of what's going to be imported, detect synonyms that duplicate either other synonyms or other labels, and mark them accordingly (can use related or broad; not sure which works best for the problem being resolved by the change in algorithm).

jamesaoverton commented 2 years ago

I'd prefer that they fix this upstream. If that's not in the cards, then I'm fine with us excluding this particular synonym in the OWL we generate.

bpeters42 commented 2 years ago
pmidford commented 2 years ago

I support demoting this to either an annotation or better yet a comment. @bpeters42 has a point about synonyms that aren't exact, but that's a different issue - human is a sloppy synonym for H. sapiens, but it is still a synonym. "Environmental sample" isn't a synonym or even really about the taxon as a whole. It's something else, perhaps metadata about individual(s) in the taxon or collection events.

bpeters42 commented 2 years ago

@pmidford , I actually have zero problems with 'human' being an exact synonym of 'homo sapiens'; my problem is that, at the same time, the parent taxon 'homo' has the exact synonym 'humans'. Which conflates singular vs. plural with a class hierarchy, and leads to craziness (Homo heidelbergensis being a kind of humans, but not a kind of human?)

Completely agree with you that 'environmental sample' is even worse. I was trying to point out that the 'exact synonyms' are not only problematic for 'environmental samples' and the like which are at the edges of what the NCBI taxonomy cares about, but also for organisms at the core of classical taxonomy, like homo sapiens.

And to be a bit more precise with what I mean by 'more loose', I thought 'alternative label'.

cmungall commented 2 years ago

Thanks for your comments

Let's start with an NCBI request - I nominate @bpeters42 or @fbastian since you both have existing relationships. @hrshdhgd can go ahead and make a PR, but we will hold off on merging it until we are sure that NCBI won't remove it.

jamesaoverton commented 2 years ago

We more closely into this, and the problem is probably on our end.

We get our data from https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip. One of the tables in there is names.dmp, and this is how it's described in the taxdump_readme.txt:

names.dmp
---------
Taxonomy names file has these fields:

    tax_id                  -- the id of node associated with this name
    name_txt                -- name itself
    unique name             -- the unique variant of this name if name not unique
    name class              -- (synonym, common name, ...)

Here's an example row from names.dmp:

tax_id name_txt unique name name class
33858 environmental samples environmental samples <diatoms,phylum Bacillariophyta> scientific name

We want out rdfs:labels to be unique, so we use the unique name if it's present. But then we also create a synonym from the 'name_txt', and that might not be the right thing to do.

This is the relevant bit of code: https://github.com/obophenotype/ncbitaxon/blob/master/src/ncbitaxon.py#L258

I'm too tired right now, but I'll come back to this tomorrow.

cmungall commented 3 days ago

This issue was fixed by @hrshdhgd 2 years ago, am closing