Closed cmungall closed 4 months ago
That repo must be private, because I get a 404.
Issue arises from another project
That's a repo we use for some internal text mining projects, where we may have corpi that people don't want shared yet. There isn't much context there, other than any time a text says "environmental sample" it false positive matches hundreds of "taxa" because that is the exact synonym that was assigned.
This would mark the first time we don't have a 100% isomorphic translation from the source, so it brings up various questions which are raised here:
In this case I would make a case an exception is justified
but we need a process for making these kinds of decisions for this ontology, I will send an email to obo-taxonomy and gather other feedback
Agree that 'environmental samples' should be an exception. That's not even a synonym for any actual taxon. If deleting it altogether is too much, then perhaps the information could be captured in a comment (or other mechanism). Another alternative is to change it from exact to related. I come across the same issue (for actual synonyms) in my automated processing of UniProtKB for PRO. For these, I examine the whole of what's going to be imported, detect synonyms that duplicate either other synonyms or other labels, and mark them accordingly (can use related or broad; not sure which works best for the problem being resolved by the change in algorithm).
I'd prefer that they fix this upstream. If that's not in the cards, then I'm fine with us excluding this particular synonym in the OWL we generate.
I support demoting this to either an annotation or better yet a comment. @bpeters42 has a point about synonyms that aren't exact, but that's a different issue - human is a sloppy synonym for H. sapiens, but it is still a synonym. "Environmental sample" isn't a synonym or even really about the taxon as a whole. It's something else, perhaps metadata about individual(s) in the taxon or collection events.
@pmidford , I actually have zero problems with 'human' being an exact synonym of 'homo sapiens'; my problem is that, at the same time, the parent taxon 'homo' has the exact synonym 'humans'. Which conflates singular vs. plural with a class hierarchy, and leads to craziness (Homo heidelbergensis being a kind of humans, but not a kind of human?)
Completely agree with you that 'environmental sample' is even worse. I was trying to point out that the 'exact synonyms' are not only problematic for 'environmental samples' and the like which are at the edges of what the NCBI taxonomy cares about, but also for organisms at the core of classical taxonomy, like homo sapiens.
And to be a bit more precise with what I mean by 'more loose', I thought 'alternative label'.
Thanks for your comments
Let's start with an NCBI request - I nominate @bpeters42 or @fbastian since you both have existing relationships. @hrshdhgd can go ahead and make a PR, but we will hold off on merging it until we are sure that NCBI won't remove it.
We more closely into this, and the problem is probably on our end.
We get our data from https://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip. One of the tables in there is names.dmp
, and this is how it's described in the taxdump_readme.txt
:
names.dmp
---------
Taxonomy names file has these fields:
tax_id -- the id of node associated with this name
name_txt -- name itself
unique name -- the unique variant of this name if name not unique
name class -- (synonym, common name, ...)
Here's an example row from names.dmp
:
tax_id | name_txt | unique name | name class |
---|---|---|---|
33858 | environmental samples | environmental samples <diatoms,phylum Bacillariophyta> | scientific name |
We want out rdfs:labels to be unique, so we use the unique name if it's present. But then we also create a synonym from the 'name_txt', and that might not be the right thing to do.
This is the relevant bit of code: https://github.com/obophenotype/ncbitaxon/blob/master/src/ncbitaxon.py#L258
I'm too tired right now, but I'll come back to this tomorrow.
This issue was fixed by @hrshdhgd 2 years ago, am closing
E.g. http://purl.obolibrary.org/obo/NCBITaxon_743727
Even though our transform generally does not alter the source I think in this case we need to filter this