iobis / Project-team-Genetic-Data

Developing guidelines for adding sequence data to OBIS
10 stars 1 forks source link

NCBI taxonomy as a taxonomic authority #5

Open SSuominen1 opened 3 years ago

SSuominen1 commented 3 years ago

Could NCBI taxonomy IDs be useful as taxonomic links? Where will these be added?

dianalg commented 3 years ago

To expand on this: To the best of my knowledge, right now, the scientificName field can only contain a Linnaean scientific name that matches on WoRMS, or an OTU identifier from BOLD or UNITE. However, many people working with DNA-derived occurrences obtain scientific names from NCBI taxonomy. Can we discuss why these names (or their associated taxonomy IDs) are not acceptable as scientificName values? Since the NCBI taxonomy seems to be a standard in this field of study, wouldn't we want to accommodate that? Right now, I'm handling this by going to progressively higher taxonomic ranks (genus --> family --> order, etc.) until I find a term that matches on WoRMS. But I'm not sure to what degree this maintains the integrity of the original data.

Secondly, records may have associated NCBI taxonomy IDs and/or GenBank IDs. If these are not acceptable in the scientificName and associated columns, where could they be included?

claudenozeres commented 3 years ago

I am interested in hearing more about this. NCBI Taxonomy Browser shows Linnaean scientific names, so these should be available for use. Then again, these names should also be matching in WoRMS--so will be available, even if 'source' is not NCBI? Is the issue then that under scientificNameID, would like to use NCBI Taxonomy ID instead of AphiaID? Does GBIF allow other sources of scientificNameID, and only OBIS requires AphiaID? Example (hyperlinks on OBIS site) Macoma calcarea on OBIS: https://obis.org/taxon/141580 Aphia ID, urn:lsid:marinespecies.org:taxname:141580 BOLD ID, 70992 NCBI ID, 1421134

For scientificNameID then, would use urn:lsid:marinespecies.org:taxname:141580, or it could be NCBI:txid1421134 But OBIS will only permit the WoRMS AphiaID--is this correct?

I imagine the challenge is if NCBI has names that are NOT available (or correct?) on WoRMS. In that case, it is a matter of updates between the two? Example with the related species, Limecola petalum. https://obis.org/taxon/880026 Not linked to NCBI because WoRMS does not show NCBI for this name, but for older name Macoma petalum: http://www.marinespecies.org/aphia.php?p=taxdetails&id=397131#links https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=425103&lvl=3&lin=f&keep=1&srchmode=1&unlock

Regarding the first example by @dianalg - there is discussion here on a verbatimScientificName https://github.com/tdwg/dwc/issues/181 Could use an available field to record NCBI term if not matching specifically in WoRMS. Have to review notes--there have been other discussions on this.

Secondly, seems similar--need to identify what are the fields to use for other ID codes.

claudenozeres commented 3 years ago

Suggestions that were recently made to me: If name is not in WoRMS, can use field 'identificationRemarks' (free-text) https://dwc.tdwg.org/terms/#identification If is not in WoRMS, can inform and will be added. If not a published/recognized name, can be on the 'Annotated List' of OBIS with explanation why not available. This paper was for image-based identifications, but may be of relevance for genetic data, too: https://www.frontiersin.org/articles/10.3389/fmars.2021.620702/full

albenson-usgs commented 3 years ago

But OBIS will only permit the WoRMS AphiaID--is this correct?

Yes, OBIS only accepts a WoRMS LSID in scientificNameID. GBIF does not have this requirement. You can use any name but it is matched to the GBIF Backbone Taxonomy.

claudenozeres commented 3 years ago

Thanks @albenson-usgs. Am curious about usage, and so now looking at 2 random marine examples on GBIF. 1) Somniosus microcephalus and 2) Leptasterias polaris. I note that (apart from 1 record for dynatax.se LSID), only OBIS records with scientificNameID, with WoRMS LSID. Most records (not coming from OBIS) on GBIF do not use scientificNameID, but rather taxonID (which is not a LSID).

Would it be acceptable to fill in taxonID with NCBI taxon code (or BOLD BIN--also used often), in addition to scientificNameID? Thus, have the WoRMS taxon name, but also information on the potential genetic identifier (BINs and NCBI do not always match 1:1 with WoRMS LSID :)

claudenozeres commented 3 years ago

Note, see my test queries of 2 marine species on GBIF here: https://doi.org/10.15468/dl.ffm32b https://doi.org/10.15468/dl.4p6qph

albenson-usgs commented 3 years ago

Would it be acceptable to fill in taxonID with NCBI taxon code (or BOLD BIN--also used often), in addition to scientificNameID?

Good question and actually I would extend it to say can we use NCBI taxon code instead of WoRMS LSID when there is no match in WoRMS? (I think that's Diana's question) This would be a question for the OBIS Steering Group. Further, it's recently come to my attention that OBIS may not be using scientificNameID correctly which has a definition of "An identifier for the nomenclatural (not taxonomic) details of a scientific name." This is only tangentially related to this topic but may factor in to how we form our recommendations.

kpitz commented 3 years ago

If WORMS IDs are preferred over NCBI IDs, it would be very useful to have a look-up table linking these two standards. I worry that searching by a name, like a genus, in isolation might accidentally give you the WoRMS ID of an organism with the same name but that is totally different lineage from the NCBI sequence you matched. There are so many records we can't manually check them all and so we rely on more automated searches and tools. If NCBI IDs are acceptable instead of WoRMS IDs then that would be much easier for us to use across our datasets.

dianalg commented 3 years ago

Yes @claudenozeres, I have used vernacularName in the way you're proposing using verbatimScientificName in the past. But I don't know that that's really acceptable/best practice...

As @claudenozeres said and @albenson-usgs clarified, my real issue is whether an NCBI name/code could be used if there is no matching name on WoRMS. So, for example, I have the name "phototrophic eukaryote" as the assigned taxon in many rows of an eDNA dataset. This has a matching name and taxonomy ID on NCBI, but is a non-Linnaean term that does not match on WoRMS. But OBIS requires a WoRMS-approved name in the scientificName column.

Right now, my options are 1) work my way up the taxonomic tree until I get a rank for "phototrophic eukaryote" that matches on WoRMS or 2) remove the record before submitting the data to OBIS. Following strategy 1, I'm putting "Biota" in the scientificName column, which from what I can tell is WoRMS's accepted name for anything that's alive. I'm also putting the associated Aphia ID in the taxonID column, and "phototrophic eukaryote" in the vernacularName column.

That said, it seems like this kind of issue will be really common for genetically-derived data. And my work around (strategy 1 above) does run the risk that @kpitz is describing.

claudenozeres commented 3 years ago

Regarding look-up table mentioned by @kpitz , that would be an important tool, and would help with adoption of use with WoRMS, so I would push for further work between the two because there are conflicts and lack of attention that I can see. Going forward, if increasingly common and not easy/satisfactory, an alternative would be not to use OBIS+WoRMS, but to publish on GBIF with taxonomy of choice. I think the first one is valuable if it leads to stronger connections and updates between resources, namely WoRMS, NCBI, and BOLD (existing links but not very solid at the moment). Similar to how OBIS became vastly improved with names once they adopted WoRMS as their taxonomic backbone, instead of continuing on their own.

claudenozeres commented 3 years ago

I recommend Dhugal Lindsay et al. 2017 for an interesting summary of issues with occurrence datasets and genetically identified taxa. https://www.tandfonline.com/doi/full/10.1080/17451000.2016.1268261. They highlight several issues to be improved for sequence data on biodiversity portals.

pieterprovoost commented 3 years ago

Thanks everyone for your valuable feedback. While we intend to keep WoRMS as our taxonomic backbone (as the NCBI disclaimer states: "the NCBI taxonomy database is not an authoritative source for nomenclature or classification"), I'm going to discuss with WoRMS and the taxonomy task team to see if we can come up with some recommendations for using NCBI and BOLD identifiers and correct use of scientificNameID, taxonID, and taxonConceptID. We should be able to come up with a technical solution to match records to our backbone using alternate identifiers in case a WoRMS LSID is not available.

Note that the WoRMS API has an endpoint to get an Aphia record by NCBI ID, for example: https://www.marinespecies.org/rest/AphiaRecordByExternalID/94237?type=ncbi. I'm not sure how complete this is.

Somewhat related: https://github.com/gbif/doc-publishing-dna-derived-data/issues/35

dianalg commented 3 years ago

Thanks, @pieterprovoost, some recommendations around this issue would be great to start. Do you have any sense of when we might expect those?

pieterprovoost commented 3 years ago

After discussing with WoRMS, we propose the following:

So for @dianalg's example I would propose this:

term value
scientificName Biota
scientificNameID urn:lsid:marinespecies.org:taxname:1
taxonConceptID NCBI:txid1899546
identificationRemarks phototrophic eukaryote

Hopefully this is a workable solution.

Finally, please note that we are not outright rejecting records without WoRMS LSID, but they may get flagged as not being linked to the taxonomic backbone. It would be a shame if people decide not to publish to OBIS at all due to this requirement. Making data findable and accessible should be the priority, even if interoperability is not perfect.

@bart-v @leenvandepitte

albenson-usgs commented 3 years ago

We believe taxonConceptID is the appropriate field for OTUs, BOLD BINs, NCBI taxonomy identifiers, etc. We are aware that there are recommendations to add OTUs in scientificName, but that is not what the term was intended for (also see here and here).

Is anyone going to bring this to TDWG for discussion with the broader community?

I note that both of the issues Pieter links to are closed and are in GBIF only discussion areas. I think we would all benefit from wider community input on how to move forward with this.

claudenozeres commented 3 years ago

I agree with @albenson-usgs --need to inform/alert/hear from TDWG or broader community. We raised these matters for OBIS, but are applicable to others (and may not be aware). @pieterprovoost's summary with example is very useful.