Open SSuominen1 opened 3 years ago
To expand on this: To the best of my knowledge, right now, the scientificName field can only contain a Linnaean scientific name that matches on WoRMS, or an OTU identifier from BOLD or UNITE. However, many people working with DNA-derived occurrences obtain scientific names from NCBI taxonomy. Can we discuss why these names (or their associated taxonomy IDs) are not acceptable as scientificName values? Since the NCBI taxonomy seems to be a standard in this field of study, wouldn't we want to accommodate that? Right now, I'm handling this by going to progressively higher taxonomic ranks (genus --> family --> order, etc.) until I find a term that matches on WoRMS. But I'm not sure to what degree this maintains the integrity of the original data.
Secondly, records may have associated NCBI taxonomy IDs and/or GenBank IDs. If these are not acceptable in the scientificName and associated columns, where could they be included?
I am interested in hearing more about this. NCBI Taxonomy Browser shows Linnaean scientific names, so these should be available for use. Then again, these names should also be matching in WoRMS--so will be available, even if 'source' is not NCBI? Is the issue then that under scientificNameID, would like to use NCBI Taxonomy ID instead of AphiaID? Does GBIF allow other sources of scientificNameID, and only OBIS requires AphiaID? Example (hyperlinks on OBIS site) Macoma calcarea on OBIS: https://obis.org/taxon/141580 Aphia ID, urn:lsid:marinespecies.org:taxname:141580 BOLD ID, 70992 NCBI ID, 1421134
For scientificNameID then, would use urn:lsid:marinespecies.org:taxname:141580, or it could be NCBI:txid1421134 But OBIS will only permit the WoRMS AphiaID--is this correct?
I imagine the challenge is if NCBI has names that are NOT available (or correct?) on WoRMS. In that case, it is a matter of updates between the two? Example with the related species, Limecola petalum. https://obis.org/taxon/880026 Not linked to NCBI because WoRMS does not show NCBI for this name, but for older name Macoma petalum: http://www.marinespecies.org/aphia.php?p=taxdetails&id=397131#links https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=425103&lvl=3&lin=f&keep=1&srchmode=1&unlock
Regarding the first example by @dianalg - there is discussion here on a verbatimScientificName https://github.com/tdwg/dwc/issues/181 Could use an available field to record NCBI term if not matching specifically in WoRMS. Have to review notes--there have been other discussions on this.
Secondly, seems similar--need to identify what are the fields to use for other ID codes.
Suggestions that were recently made to me: If name is not in WoRMS, can use field 'identificationRemarks' (free-text) https://dwc.tdwg.org/terms/#identification If is not in WoRMS, can inform and will be added. If not a published/recognized name, can be on the 'Annotated List' of OBIS with explanation why not available. This paper was for image-based identifications, but may be of relevance for genetic data, too: https://www.frontiersin.org/articles/10.3389/fmars.2021.620702/full
But OBIS will only permit the WoRMS AphiaID--is this correct?
Yes, OBIS only accepts a WoRMS LSID in scientificNameID
. GBIF does not have this requirement. You can use any name but it is matched to the GBIF Backbone Taxonomy.
Thanks @albenson-usgs. Am curious about usage, and so now looking at 2 random marine examples on GBIF. 1) Somniosus microcephalus and 2) Leptasterias polaris. I note that (apart from 1 record for dynatax.se LSID), only OBIS records with scientificNameID, with WoRMS LSID. Most records (not coming from OBIS) on GBIF do not use scientificNameID, but rather taxonID (which is not a LSID).
Would it be acceptable to fill in taxonID with NCBI taxon code (or BOLD BIN--also used often), in addition to scientificNameID? Thus, have the WoRMS taxon name, but also information on the potential genetic identifier (BINs and NCBI do not always match 1:1 with WoRMS LSID :)
Note, see my test queries of 2 marine species on GBIF here: https://doi.org/10.15468/dl.ffm32b https://doi.org/10.15468/dl.4p6qph
Would it be acceptable to fill in taxonID with NCBI taxon code (or BOLD BIN--also used often), in addition to scientificNameID?
Good question and actually I would extend it to say can we use NCBI taxon code instead of WoRMS LSID when there is no match in WoRMS? (I think that's Diana's question) This would be a question for the OBIS Steering Group. Further, it's recently come to my attention that OBIS may not be using scientificNameID
correctly which has a definition of "An identifier for the nomenclatural (not taxonomic) details of a scientific name." This is only tangentially related to this topic but may factor in to how we form our recommendations.
If WORMS IDs are preferred over NCBI IDs, it would be very useful to have a look-up table linking these two standards. I worry that searching by a name, like a genus, in isolation might accidentally give you the WoRMS ID of an organism with the same name but that is totally different lineage from the NCBI sequence you matched. There are so many records we can't manually check them all and so we rely on more automated searches and tools. If NCBI IDs are acceptable instead of WoRMS IDs then that would be much easier for us to use across our datasets.
Yes @claudenozeres, I have used vernacularName
in the way you're proposing using verbatimScientificName
in the past. But I don't know that that's really acceptable/best practice...
As @claudenozeres said and @albenson-usgs clarified, my real issue is whether an NCBI name/code could be used if there is no matching name on WoRMS. So, for example, I have the name "phototrophic eukaryote" as the assigned taxon in many rows of an eDNA dataset. This has a matching name and taxonomy ID on NCBI, but is a non-Linnaean term that does not match on WoRMS. But OBIS requires a WoRMS-approved name in the scientificName
column.
Right now, my options are 1) work my way up the taxonomic tree until I get a rank for "phototrophic eukaryote" that matches on WoRMS or 2) remove the record before submitting the data to OBIS. Following strategy 1, I'm putting "Biota" in the scientificName
column, which from what I can tell is WoRMS's accepted name for anything that's alive. I'm also putting the associated Aphia ID in the taxonID
column, and "phototrophic eukaryote" in the vernacularName
column.
That said, it seems like this kind of issue will be really common for genetically-derived data. And my work around (strategy 1 above) does run the risk that @kpitz is describing.
Regarding look-up table mentioned by @kpitz , that would be an important tool, and would help with adoption of use with WoRMS, so I would push for further work between the two because there are conflicts and lack of attention that I can see. Going forward, if increasingly common and not easy/satisfactory, an alternative would be not to use OBIS+WoRMS, but to publish on GBIF with taxonomy of choice. I think the first one is valuable if it leads to stronger connections and updates between resources, namely WoRMS, NCBI, and BOLD (existing links but not very solid at the moment). Similar to how OBIS became vastly improved with names once they adopted WoRMS as their taxonomic backbone, instead of continuing on their own.
I recommend Dhugal Lindsay et al. 2017 for an interesting summary of issues with occurrence datasets and genetically identified taxa. https://www.tandfonline.com/doi/full/10.1080/17451000.2016.1268261. They highlight several issues to be improved for sequence data on biodiversity portals.
Thanks everyone for your valuable feedback. While we intend to keep WoRMS as our taxonomic backbone (as the NCBI disclaimer states: "the NCBI taxonomy database is not an authoritative source for nomenclature or classification"), I'm going to discuss with WoRMS and the taxonomy task team to see if we can come up with some recommendations for using NCBI and BOLD identifiers and correct use of scientificNameID
, taxonID
, and taxonConceptID
. We should be able to come up with a technical solution to match records to our backbone using alternate identifiers in case a WoRMS LSID is not available.
Note that the WoRMS API has an endpoint to get an Aphia record by NCBI ID, for example: https://www.marinespecies.org/rest/AphiaRecordByExternalID/94237?type=ncbi. I'm not sure how complete this is.
Somewhat related: https://github.com/gbif/doc-publishing-dna-derived-data/issues/35
Thanks, @pieterprovoost, some recommendations around this issue would be great to start. Do you have any sense of when we might expect those?
After discussing with WoRMS, we propose the following:
scientificNameID
. There's certainly value in providing an NCBI taxonomy identifier, but NCBI explicitely states that they are not a taxonomic authority. There's also something to be said about the quality of identifications in/based on NCBI (see Lindsay et al. as mentioned by @claudenozeres) but that's probably another discussion.scientificNameID
is the appropriate field for WoRMS LSIDs as these refer to names, not taxon concepts.taxonConceptID
is the appropriate field for OTUs, BOLD BINs, NCBI taxonomy identifiers, etc. We are aware that there are recommendations to add OTUs in scientificName
, but that is not what the term was intended for (also see here and here).taxonConceptID
.So for @dianalg's example I would propose this:
term | value |
---|---|
scientificName | Biota |
scientificNameID | urn:lsid:marinespecies.org:taxname:1 |
taxonConceptID | NCBI:txid1899546 |
identificationRemarks | phototrophic eukaryote |
Hopefully this is a workable solution.
Finally, please note that we are not outright rejecting records without WoRMS LSID, but they may get flagged as not being linked to the taxonomic backbone. It would be a shame if people decide not to publish to OBIS at all due to this requirement. Making data findable and accessible should be the priority, even if interoperability is not perfect.
@bart-v @leenvandepitte
We believe taxonConceptID is the appropriate field for OTUs, BOLD BINs, NCBI taxonomy identifiers, etc. We are aware that there are recommendations to add OTUs in scientificName, but that is not what the term was intended for (also see here and here).
Is anyone going to bring this to TDWG for discussion with the broader community?
I note that both of the issues Pieter links to are closed and are in GBIF only discussion areas. I think we would all benefit from wider community input on how to move forward with this.
I agree with @albenson-usgs --need to inform/alert/hear from TDWG or broader community. We raised these matters for OBIS, but are applicable to others (and may not be aware). @pieterprovoost's summary with example is very useful.
Could NCBI taxonomy IDs be useful as taxonomic links? Where will these be added?