globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
18 stars 3 forks source link

accepting suggestions on how to handle "doubtful" name relations #100

Open jhpoelen opened 2 years ago

jhpoelen commented 2 years ago

@seltmann @jhammock @katjaschulz

Some taxonomic resources keep track of doubtful names, in addition to supporting synonyms and accepted name relations. (see #96)

Any suggestions on what language to use to express doubtful names?

KatjaSchulz commented 2 years ago

I have come across several providers using "doubtful" as a value for dwc:taxonomicStatus. I also sometimes see similar values like "ambiguous" or "unchecked". As far as I know, there is no universally accepted controlled vocabulary for dwc:taxonomicStatus, so there isn't a standardized way to express that the taxonomic status of the taxon is unresolved. Therefore, it's up to the data user to decide whether you want to use these names and the data associated with them. In some cases, the "doubtful" names are a mess and are best ignored entirely. In other cases, these names are still in use and perfectly suitable for data exchange. They just haven't been reviewed to the standard required by the source. In the latter case, you can treat them as accepted names, but you may choose to include the "doubtful" in the dwc:taxonRemarks field.

jhpoelen commented 2 years ago

@KatjaSchulz thanks for responding.

I'll opt for accepted name until claimed otherwise.

would you agree that an "unchecked" value in taxon status sounds like a qualifier of related to some other claim:

e.g., unchecked accepted name

or

unchecked synonym

?

KatjaSchulz commented 2 years ago

I think the three labels generally mean the same thing. A name is not yet considered accepted, because it has not been checked, there are doubts about its status, or its status is known to be ambiguous. Since these names are not mapped to accepted names, you can't treat them as synonyms, so your only option is to toss them or to treat them in the same fashion as accepted names. COL has a taxonomic status value of "provisionally accepted" which captures the spirit of treating a name like an accepted name when you're really not sure that's appropriate.

jtmiller28 commented 1 year ago

Hey @jhpoelen, im interested in comparing some of these "doubtful" relations in my dataset. Is there a modification I can make in properties to pull these relations?

jhpoelen commented 1 year ago

@jtmiller28 neat to hear that you are interested in dubious things ; )

I can help make this visible, and I don't think you can do that using property files just yet. I like the idea though!

How do you imagine these doubtful relations to show up in your results? Can you give an example?

jtmiller28 commented 1 year ago

Since I've been doing a very deep dive into taxonomic resolution at different levels, I plan to make a table comparison using my data to feature how certain decisions effect overall resolution differences in large scale occurrence datasets. This will compare processing verbatim data vs provider data, inclusion of fuzzy mapping vs exclusion, and with this topic hopefully certainty in designation status.

Im overall curious as to how these "Doubtful", as well as the "Unchecked" names contribute to my data's name resolution using World Flora Online. It seems to me that quantitative stats might allow us to get a better idea of their overall presence in large datasets and whether their magnitude of effect is considerable for final resolution.

When I was doing some quality control on my resolved names with multiple mappings I happened to notice that one of the names I was considering remapping would remap to a "Doubtful" name, (see Ilex california , http://www.worldfloraonline.org/taxon/wfo-0001265295).

Initially I thought not to include it since it seems uncertain anyway, however I realized that wouldn't be uniform methodology (since there its quite probable other "Doubtful" names mapped into my dataset). This posses the question of whether I should trust these names enough to incorporate into the analysis when there is uncertainty surrounding them. WFO hasn't made it clear as to what makes a plant designation "ambiguous" which maps to the "Doubtful" relation besides that they derive it from a source to infer that (in my example Tropicos), so unfortunately I can't fully determine what is safe and problematic to use here. Therefore, showing how much it could effect my data seems to be the next best option.

echo -e "\tIlex california" | nomer append wfo returns Ilex california HAS_ACCEPTED_NAME WFO:0001265295 Ilex californiaspecies Angiosperms | Aquifoliales | Aquifoliaceae | Ilex | Ilex california WFO:9949999999 | WFO:9000000023 | WFO:7000000041 | WFO:4000018994 | WFO:0001265295 phylum | order | family | genus | species http://www.worldfloraonline.org/taxon/wfo-0001265295

jtmiller28 commented 1 year ago

Having a chat with Alan Elliott from World Flora Online over email to get some clarifications on "Unchecked" and "Doubtful/Ambiguous". Unchecked indicates the record has not gone through scrutiny by taxonomic experts as of yet or is experiencing conflicting designations. The term doubtful is exactly the same as their designation of ambiguous. Ambiguous records originate from records that cannot be resolved due to one or more of the following: poor description, no voucher specimen available to confirm, or no illustration/media of voucher specimen. These designation types are maintained as names since its up to the Curator if they wish to use it for their collection however.

For large scale research questions utilizing a broad range of taxa, this seems to present an issue since it suggests they are less trackable names that are nested in ambiguity. I'll include them in the table to show, but I am leaning to the exclusion of doubtful names in particular when incapable of providing an expert opinion on the particular specimen in question. Unchecked are more of a grey area, probably going to utilize the innocent till proven guilty strategy concerning these names.

Looking forward to what comes from Issue #114 , for curators in particular wanting to understand whether they want to use a name when it falls into a grey area of unchecked or ambigious/doubtful this sounds like a great direction.

jhpoelen commented 1 year ago

@jtmiller28 thanks for sharing your notes! Given what you know now, how would you like to have Nomer behave when reporting on name associations captured in WFO ?

jtmiller28 commented 1 year ago

I would probably suggest making it more visible, as possibly a second relation field or taxonRemarks as suggested by KatjaSchulz. It would be helpful for it to be clear what category your names fall into in order to isolate dubious names for your data/collection.

jtmiller28 commented 1 year ago

@jhpoelen Hi Jorrit, I was thinking of this again today after a labmate was going through a deepdive manually on WorldFlora Online's unchecked names. Could this conceivably be added to Nomer's properties call? It would be very informative for creating datasets of multiple levels of assurance, i.e. one of only taxonomic expert review, one with accepting some error with unchecked, and a dataset looking at how doubtful names affect occurrence data pulls.

The way I see this is instead of the following functionality: echo -e "\tHyospathe chiriqui" | nomer append wfo returns: Hyospathe chiriqui HAS_ACCEPTED_NAME WFO:0001316767 Hyospathe chiriqui Schaedtler species Hyospathe chiriqui WFO:0001316767 species http://www.worldfloraonline.org/taxon/wfo-0001316767

HAS_ACCEPTED_NAME should be swapped with IS_AMBIGUOUS, leaving it up to the user to accept/remove these records based upon their knowledge rather than confirming it to HAS_ACCEPTED_NAME status.

Thanks!

jhpoelen commented 1 year ago

@jtmiller28 thanks for sharing your suggestion to swap HAS_ACCEPTED_NAME with IS_AMBIGUOUS. I was wondering whether HAS_UNCHECKED_NAME or HAS_AMBIGUOUS_NAME would work also. To me IS_AMBIGUOUS sounds like an attribute of a thing, instead some relation between two things.

Curious to hear your thoughts.

jtmiller28 commented 1 year ago

Hi jorrit, I would personally suggest HAS_AMBIGUOUS_NAME for this particular scenario as the name is not actually unchecked. The name is checked and has been defined by WFO to be doubtful due to lacking a type specimen and/or enough description to justify its name. Unchecked names are still present in WFO; however, those are names that haven't received any taxonomic scrutiny as defined by wfo's standards. I currently treat these as innocent till proven guilty (as a large number of names are this way), but that may need to change. Good to see we can pull out the unchecked field, I'll take a look into how many of these are actually present in NA plant data.

jhpoelen commented 1 year ago

@jtmiller28 thanks for commenting! WFO has some status "Unchecked" in their data. However, I can't find anything related to ambiguous. Any suggestions how to distinguish between the two -

Tracking a recent copy of WFO using preston track http://104.198.143.165/files/WFO_Backbone/_WFOCompleteBackbone/WFO_Backbone.zip yielded alias

preston alias
<http://104.198.143.165/files/WFO_Backbone/_WFOCompleteBackbone/WFO_Backbone.zip> <http://purl.org/pav/hasVersion> <hash://sha256/25a35248d3820cdf323331272e07f6fe9b25942fa9e7efe0ee7969c7f7033ada> <urn:uuid:6f1d6de0-b8ab-4472-b2d4-8d357b8d3ba3> .

and using this specific versioned WFO with content id hash://sha256/25a35248d3820cdf323331272e07f6fe9b25942fa9e7efe0ee7969c7f7033ada

helped located the record associated to the name you referenced earlier:

preston cat 'line:zip:hash://sha256/25a35248d3820cdf323331272e07f6fe9b25942fa9e7efe0ee7969c7f7033ada!/classification.csv!/L1,L1287047'\
 | mlr --itsvlite --oxtab cat
taxonID                  wfo-0001316767
scientificNameID         urn:lsid:ipni.org:names:77241622-1
localID                  471504-wcs
scientificName           "Hyospathe chiriqui"
taxonRank                species
parentNameUsageID        
scientificNameAuthorship Schaedtler
family                   Arecaceae
subfamily                
tribe                    
subtribe                 
genus                    Hyospathe
subgenus                 
specificEpithet          chiriqui
infraspecificEpithet     
verbatimTaxonRank        species
nomenclaturalStatus      
namePublishedIn          "Hamburger Garten- Blumenzeitung 31: 168. 1875 (1875)"
taxonomicStatus          Unchecked
acceptedNameUsageID      
originalNameUsageID      
nameAccordingToID        
taxonRemarks             "Source in seed data: wcvp Updated comments from Null to https://wcsp.science.kew.org/namedetail.do?name_id=471504, information provided by April 15 2021"
created                  2022-04-16
modified                 2022-08-15
references               
source                   "The Arecaceae TEN"
majorGroup               A
tplID                    
jtmiller28 commented 1 year ago

Interesting...that does pose a problem. Curious as to how the field Unchecked is showing up, is that how it appears in the wfo backbone verbatim? Is the preston pull processed at all I guess is my question there, since they have to have some way of knowing its part of this ambiguous category: wfo has it noted in the search query on their site to select between the 4 options of Accepted Name, Ambigious, Synonym, or Unchecked. http://www.worldfloraonline.org/search?query=

jhpoelen commented 1 year ago

Curious as to how the field Unchecked is showing up, is that how it appears in the wfo backbone verbatim?

Yes, the record shared is as it appears in the resource provided by WFO. The only processing done on the record is by mlr --itsvlite --oxtab cat which rotates the table from wide to long form for the specific record.

jhpoelen commented 1 year ago

and yes, it does appear that WFO labeled the name Hyospathe chiriqui as ambiguous for their status http://www.worldfloraonline.org/search?query=Hyospathe+chiriqui .

image

jhpoelen commented 1 year ago

For some reason, I can only find a single ambiguous name, and it does not appears to be Hyospathe chiriqui .

$ unzip -p Complete_WFO_Backbone.zip classification.txt  | cut -f20 | sort | uniq -c | sort -nr
 672629 Synonym
 384841 Accepted
 257127 Unchecked
   3347 heterotypicSynonym
    539 homotypicSynonym
      1 synonym
      1 ambiguous
      1 
jorrit@larus:~/tmp$ unzip -p Complete_WFO_Backbone.zip classification.txt  | grep ambiguous
wfo-4000003338  17370850-1  0   ×Asplenicystopteris genus   wfo-7000000051  P.Fourn.    Aspleniaceae                    Quatre Fl. France 10, nomen (1934).     ambiguous               More details could be found in <a href=http://www.theplantlist.org/1.1/browse/P/Woodsiaceae/Asplenicystopteris/ >The Plant List v.1.1.</a>          http://www.theplantlist.org/1.1/browse/P/Woodsiaceae/Asplenicystopteris/            http://www.theplantlist.org/1.1/browse/P/Woodsiaceae/Asplenicystopteris/

Any chance you can ask around and figure out what it going on? I am sure how they come up with the "ambiguous" state listened in their website.

jhpoelen commented 1 year ago

According to the species landing page, the taxonomicStatic "ambiguous" originated via

The record derives from wcvp which reports it as a Doubtful name (record 471504-wcs )

http://www.worldfloraonline.org/taxon/wfo-0001316767

image

jhpoelen commented 1 year ago

Going down the rabbit hole . . . looking for the provenance (or origin) of the wcvp (or, via https://powo.science.kew.org/about-wcvp, World Checklist of Vascular Plants). . .

et voila - taxonomic status . . . not ambiguous, but "unplaced"

preston track http://sftp.kew.org/pub/data-repositories/WCVP/wcvp_dwca.zip\
 | preston dwc-stream\
 | grep Hyospathe\
 | grep chiriqui\
 | jq .

yielded

{
  "http://www.w3.org/ns/prov#wasDerivedFrom": "line:zip:hash://sha256/5e87e89a920bb6e4dff0e9afca18f976d74be945c45fd4e91b3347a396d358be!/wcvp_taxon.csv!/L513391",
  "http://www.w3.org/1999/02/22-rdf-syntax-ns#type": "http://rs.tdwg.org/dwc/terms/Taxon",
  "http://rs.tdwg.org/dwc/text/id": "471504",
  "http://rs.tdwg.org/dwc/terms/namePublishedIn": "Hamburger Garten- Blumenzeitung 31: 168 (1875)",
  "http://rs.tdwg.org/dwc/terms/originalNameUsageID": null,
  "http://rs.tdwg.org/dwc/terms/dynamicProperties": "{\"powoid\":\"77241622-1\",\"lifeform\":\"\",\"climate\":\"\",\"homotypicsynonym\":\"\",\"hybridformula\":\"\",\"reviewed\":\"Y\"}",
  "http://rs.tdwg.org/dwc/terms/genus": "Hyospathe",
  "http://rs.tdwg.org/dwc/terms/parentNameUsageID": "101295",
  "http://rs.tdwg.org/dwc/terms/taxonID": "471504",
  "http://rs.tdwg.org/dwc/terms/scientificNameAuthorship": "Schaedtler",
  "http://rs.tdwg.org/dwc/terms/taxonRank": "Species",
  "http://rs.tdwg.org/dwc/terms/scientificNameID": "ipni:77241622-1",
  "http://rs.tdwg.org/dwc/terms/specificEpithet": "chiriqui",
  "http://purl.org/dc/terms/references": "https://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:77241622-1",
  "http://rs.tdwg.org/dwc/terms/acceptedNameUsageID": null,
  "http://rs.tdwg.org/dwc/terms/scientificName": "Hyospathe chiriqui",
  "http://rs.tdwg.org/dwc/terms/taxonomicStatus": "Unplaced",
  "http://rs.tdwg.org/dwc/terms/nomenclaturalStatus": null,
  "http://rs.tdwg.org/dwc/terms/infraspecificEpithet": null,
  "http://rs.tdwg.org/dwc/terms/family": "Arecaceae",
  "http://rs.tdwg.org/dwc/terms/taxonRemarks": "Costa Rica"
}
jhpoelen commented 1 year ago

see also, https://powo.science.kew.org/taxon/urn:lsid:ipni.org:names:77241622-1

image

jtmiller28 commented 1 year ago

Yes I can ask around,I've emailed with them before and they're responsive so hopefully we can figure this out.

jhpoelen commented 1 year ago

@jtmiller28 excellent! Please do note that comments I made just now.

jhpoelen commented 1 year ago

Also, note that https://powo.science.kew.org/about-wcvp accessed on 2023-05-04 states:

[...] More recently it has become the default taxonomic backbone of World Flora Online (WFO) and we have been working closely with some of the Taxonomic Expert Networks (TEN) to use the WCVP data and platform as the basis for managing the taxonomic data of their TEN. [...]

jhpoelen commented 1 year ago

For what it is worth - it appears that WFO keeps some records of source data, see attached screenshot. However, I am not clear how they keep track off specific versions of resources as they contributed to specific taxonomic records in WFO.

image

rogerhyam commented 1 year ago

Alan Elliott pointed this discussion out to me as I'm working on the WFO Plant List technical side. We thought it would be useful to share our workings.

Some background for the uninitiated.

We (RBG Edinburgh) took over responsibility for the Plant List (taxonomic backbone) for the WFO from last year to free up the guys at Missouri to work on content for the portal. We built a new system to handle the list and in doing that we introduced a bit more rigorous data model. At the moment what you see through the portal will be somewhat confusing because it continues to use some of the old terminology.

Data on the plant list is exposed via a web service (here https://list.worldfloraonline.org/). It is also published to Zenodo and GBIF ChecklistBank on a six monthly cycle. The web service is used to power the plant list section of the WFO portal here (https://wfoplantlist.org/plant-list/).

We have a taxonomic editor we have built called Rhakhis that the Taxonomic Expert Networks can use to edit data directly.

Taxonomic and nomenclatural status

We are working on some overarching documentation (work in progress!) but it is all in github and you can see it being created here.

https://plant-list-docs.rbge.info/concepts.html

We have a strong separation of nomenclature vs taxonomy and so separation of nomenclatural status and taxonomic status.

https://plant-list-docs.rbge.info/concepts.html#taxonomic-status--role

I was very keen to eliminate "doubtful" and "unchecked" for which I couldn't extract good definitions. Doubtful about what exactly? How doubtful? Who doubts? Unchecked is usually an editorial status not a decision. i.e. a property of the system not the name. We use "unknown" because we may have "checked" but failed to resolve it. We all suffer from a reluctance to admit ignorance :)

We introduced the notion of "deprecated" names.

https://plant-list-docs.rbge.info/concepts.html#more-on-deprecation

This is all just what we are doing to build our list of all plants. Other people may have other requirements but obviously as we are not using "doubtful" we don't think it is a good idea to use it.

We have made the progress we have made mainly because we have ignored animals! This is very much focussed on the botanical code of nomenclature.

Anyhow I hope this is useful.

jtmiller28 commented 1 year ago

Thanks @rogerhyam! Im glad the ambiguity surrounding unchecked and doubtful designations is being disentangled into more informative categories. Sounds like in the future we'll be able to see these designations which will be really helpful for research applications.

@jhpoelen Would it be interesting/worthwhile to capture that snapshot (https://wfoplantlist.org/plant-list/) as a catalog for Nomer as its a stable version of checked names? We can still use the current catalog portal, but it contains ambiguity and requires filtering past the level that most users would know (i.e. properties and removal of Unchecked names).

jhpoelen commented 1 year ago

@rogerhyam @jtmiller28 thanks for sharing your useful insights!

Would it be interesting/worthwhile to capture that snapshot (https://wfoplantlist.org/plant-list/) as a catalog for Nomer as its a stable version of checked names?

That depends on folks like you, who'd actually use the resource for fast and versioned name alignment. Do you imagine this in place of the existing wfo ? Is there a way to check whether you previous issues would be resolved by adopting the wfoplant list ?

WUlate commented 1 year ago

Please use https://files.worldfloraonline.org/Files/WFO_Backbone/_WFOCompleteBackbone/WFO_Backbone.zip instead of the previous http://104.198.143.165/files/WFO_Backbone/_WFOCompleteBackbone/WFO_Backbone.zip

jhpoelen commented 1 year ago

@WUlate thanks for the suggestion to update the WFO backbone endpoint for Nomer. Please see https://github.com/globalbioticinteractions/nomer/commit/a8000e9033aa249c8e006bf90f47bfe737fdf1d3 for updated configuration. Changes should be included in the next Nomer corpus/tool versions.

jhpoelen commented 1 year ago

Still curious about what @jtmiller28 and @rogerhyam have to say about:

Would it be interesting/worthwhile to capture that snapshot (https://wfoplantlist.org/plant-list/) as a catalog for Nomer as its a stable version of checked names?

That depends on folks like you, who'd actually use the resource for fast and versioned name alignment. Do you imagine this in place of the existing wfo ? Is there a way to check whether you previous issues would be resolved by adopting the wfoplant list ?

jtmiller28 commented 1 year ago

Looking it to it currently to show the difference. I'll restructure my pull to include this status information and SQL script to account for how many unique records in North America are 'Unchecked' names. My suspicion is it will be reasonably high enough to warrant concern.

jhpoelen commented 1 year ago

@jtmiller28 thanks for responding - and - I'd be happy to add support especially when the taxonomic resource would be used - with use, we can tease out bugs and mature feature. Curious to hear about your findings.

WUlate commented 1 year ago

In case you're interested, we keep the snapshots of the WFO Taxonomic Backbone at:

https://files.worldfloraonline.org/Files/WFO_Backbone/_WFOCompleteBackbone/archive

And we're uploading the WFO Taxonomic Backbone every 6 months to Zenodo DOI 10.5281/zenodo.7460141 where you can also find the versioned TB:

https://zenodo.org/search?q=conceptrecid:"7460141"&sort=-version&all_versions=True