globalbioticinteractions / nomer

maps identifiers and names to other identifiers and names
GNU General Public License v3.0
19 stars 3 forks source link

gn-parse makes designations into hybrids, tripping nomer to fail in resolution #107

Open jtmiller28 opened 2 years ago

jtmiller28 commented 2 years ago

It appears that using gn-parse within nomer will assume that specificEpithets that begin with 'x' are hybrids resulting in a failure to match.

Ex. echo -e "\tCylindropuntia ×fosbergii\t(C.B.Wolf) Rebman, M.A.Baker & Pinkava" | nomer replace gn-parse | nomer append wfo

Outputs: Cylindropuntia × fosbergii (C.B.Wolf) Rebman, M.A.Baker & Pinkava NONE Cylindropuntia × fosbergii

If manually done without using gn-parse mapping is successful

Ex. echo -e "\tCylindropuntia ×fosbergii" | nomer append wfo

Outputs: Cylindropuntia ×fosbergii HAS_ACCEPTED_NAME WFO:0000632373 Cylindropuntia ×fosbergii species Cylindropuntia ×fosbergii WFO:0000632373 species http://www.worldfloraonline.org/taxon/wfo-0000632373

jhpoelen commented 2 years ago

@jtmiller28 thanks for sharing these details.

Using your example, I found that gn-parse appears to insert a whitespace after the ```×```` character.

$ echo -e "\tCylindropuntia ×fosbergii\t(C.B.Wolf) Rebman, M.A.Baker & Pinkava" | nomer append gn-parse
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [gn-parse]
    Cylindropuntia ×fosbergii   (C.B.Wolf) Rebman, M.A.Baker & Pinkava  SAME_AS     Cylindropuntia × fosbergii                  

before/after:

Cylindropuntia ×fosbergii (before gn-parse)
Cylindropuntia × fosbergii (after gn-parse)
jhpoelen commented 2 years ago

and, it appears that WFO reports the name as Cylindropuntia ×fosbergii. Currently, Nomer does does exact matches on nomer append wfo in an effort to separate pre-processing (e.g., name normalization/parsing) from matching. And, perhaps this example shows that application of taxonomic name formatting for hybrids appears different comparing the "gn-parse"-universe and the WFO-universe.

@dimus @qgroom I was wondering whether you have any insights in the application / rules of hybrids in botanical nomenclature.

jhpoelen commented 2 years ago

@jtmiller28 note that it appears that the gbif parser also adds the whitespace -

$ echo -e "\tCylindropuntia ×fosbergii" | nomer append gbif-parse
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [gbif-parse]
    Cylindropuntia ×fosbergii   SAME_AS     Cylindropuntia × fosbergii              
jhpoelen commented 2 years ago

@jtmiller28 My main question is - why does the World of Flora Online use Cylindropuntia ×fosbergii whereas two pretty well known name parsers (globalnames, gbif parser) inject a whitespace in between the × and the name fosbergii?

A possible (silly) workaround for facilitate matching against WFO using sed without adding built in name interpretation into Nomer's WFO can be:

$ echo -e "\tCylindropuntia × fosbergii" | sed 's/× /×/g' | nomer append wfo
    Cylindropuntia ×fosbergii   HAS_ACCEPTED_NAME   WFO:0000632373  Cylindropuntia ×fosbergii   species     Angiosperms | Caryophyllales | Cactaceae | Cylindropuntia | Cylindropuntia ×fosbergii   WFO:9949999999 | WFO:9000000088 | WFO:7000000098 | WFO:4000010332 | WFO:0000632373  phylum | order | family | genus | species   http://www.worldfloraonline.org/taxon/wfo-0000632373    
jtmiller28 commented 2 years ago

@jhpoelen Yes I was talking to some plant systemist about this the other day, it does seem to be a proper/used way to name hybrids using genus x specificEpithet . Your insight on WFO having their specific way of attaching 'x' to the specificEpithet is definitely what causes the issue.

I do see that workaround being effective considering this is more of a catalogue peculiarity and should probably be brought to their attention for a more complete mapping list, I'll make that consideration for when working with the data

jhpoelen commented 2 years ago

@jtmiller28 thanks for sharing your outcomes with experts on this. Do you think it would make WFO more usable if we "fix" the funny WFO provided names by inserted the whitespace on indexing / parsing?

jhpoelen commented 2 years ago

Also, there surely must be a reason why WFO chose to adopt this naming convention. Do you have any idea why?

jtmiller28 commented 2 years ago

Honestly not too sure! My fear is that applying a large scale fix might just trip other unforseen issues is regular expressions..Im not an expert in sed yet, so I wouldn't know if that might trip in other instances where a genus is prefixed by x, example: ×Aegilotriticum requienii .

As for why they do it this way...Im still unsure since plants have much more unique cases in hybrids (the biological species concept is difficult to justify in many cases) but my labs PIs are actually meeting with some of WFO representatives this Monday over in the EU so I'm building a list of questions about the nature of their catalogue to ask about and hopefully gain some insight on, so this will definitely be part of it.

jhpoelen commented 2 years ago

@jtmiller28 very much would like to hear the backstory on the WFO hybrid naming conventions. Thanks for being thorough on this. Formatting names are pretty important I think, especially for integrating lots of datasets.

dimus commented 2 years ago

Whitespace: "In the wild" names exist with and without space between hybrid sign and specific epithet. I think both gnparser and gbif parser desided on having a space for the same reason: it helps to prevent conversion by mistake of the hybrid sign into a letter 'x' by a human or OCR, creating a misspelling that cannot be fixed easily. The situtaion also gets more complicated when hybrid sign is completely omitted, or replaced with 'X' or 'x' characters.

There are 3 "canonical forms" in gnparser.

"Stemmed" -- it removes all hybrid signs and suffixes from specific epithets and it should be used for lexical reconsiliation.

"Simple" -- it preserves suffixes, but removes hybrid signs. It can be used for calculating editing distances and for less accurate lexical reconciliation.

"Full" -- with hybrid signs and other details, it is usually used for displaying a name without aftorship.

I do lexical reconciliation by comparing stemmed canonical form of input with stemmed canonical forms of aggregated names. This way hybrid signs do not prevent matching. Sadly, I do think that preprocessing has to be included, or some matches will not happen.

Things below can be scratched :) Hm, I imagine the problem is with old Scala gnparser @jhpoelen:

http://parser.globalnames.org/?format=html&names=Cylindropuntia+%C3%97fosbergii&with_details=on

Go-written current parser does parse hybrids correctly. There were a lot of improvements in parsing quality since Scala version.

Names with hybrid signs have a lot of vairations. Quite often people use 'x' and 'X' characters instead of UTF-8 multiplication character, somtimes they add spaces, sometimes not. Also there are "notho" names, where hybrids are implied.