Open jtmiller28 opened 2 years ago
@jtmiller28 thanks for sharing these details.
Using your example, I found that gn-parse appears to insert a whitespace after the ```×```` character.
$ echo -e "\tCylindropuntia ×fosbergii\t(C.B.Wolf) Rebman, M.A.Baker & Pinkava" | nomer append gn-parse
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [gn-parse]
Cylindropuntia ×fosbergii (C.B.Wolf) Rebman, M.A.Baker & Pinkava SAME_AS Cylindropuntia × fosbergii
before/after:
Cylindropuntia ×fosbergii (before gn-parse)
Cylindropuntia × fosbergii (after gn-parse)
and, it appears that WFO reports the name as Cylindropuntia ×fosbergii
. Currently, Nomer does does exact matches on nomer append wfo
in an effort to separate pre-processing (e.g., name normalization/parsing) from matching. And, perhaps this example shows that application of taxonomic name formatting for hybrids appears different comparing the "gn-parse"-universe and the WFO-universe.
@dimus @qgroom I was wondering whether you have any insights in the application / rules of hybrids in botanical nomenclature.
@jtmiller28 note that it appears that the gbif parser also adds the whitespace -
$ echo -e "\tCylindropuntia ×fosbergii" | nomer append gbif-parse
[main] INFO org.globalbioticinteractions.nomer.match.TermMatcherRegistry - using matcher [gbif-parse]
Cylindropuntia ×fosbergii SAME_AS Cylindropuntia × fosbergii
@jtmiller28 My main question is - why does the World of Flora Online use Cylindropuntia ×fosbergii
whereas two pretty well known name parsers (globalnames, gbif parser) inject a whitespace in between the ×
and the name fosbergii
?
A possible (silly) workaround for facilitate matching against WFO using sed
without adding built in name interpretation into Nomer's WFO can be:
$ echo -e "\tCylindropuntia × fosbergii" | sed 's/× /×/g' | nomer append wfo
Cylindropuntia ×fosbergii HAS_ACCEPTED_NAME WFO:0000632373 Cylindropuntia ×fosbergii species Angiosperms | Caryophyllales | Cactaceae | Cylindropuntia | Cylindropuntia ×fosbergii WFO:9949999999 | WFO:9000000088 | WFO:7000000098 | WFO:4000010332 | WFO:0000632373 phylum | order | family | genus | species http://www.worldfloraonline.org/taxon/wfo-0000632373
@jhpoelen Yes I was talking to some plant systemist about this the other day, it does seem to be a proper/used way to name hybrids using genus x specificEpithet . Your insight on WFO having their specific way of attaching 'x' to the specificEpithet is definitely what causes the issue.
I do see that workaround being effective considering this is more of a catalogue peculiarity and should probably be brought to their attention for a more complete mapping list, I'll make that consideration for when working with the data
@jtmiller28 thanks for sharing your outcomes with experts on this. Do you think it would make WFO more usable if we "fix" the funny WFO provided names by inserted the whitespace on indexing / parsing?
Also, there surely must be a reason why WFO chose to adopt this naming convention. Do you have any idea why?
Honestly not too sure! My fear is that applying a large scale fix might just trip other unforseen issues is regular expressions..Im not an expert in sed yet, so I wouldn't know if that might trip in other instances where a genus is prefixed by x, example: ×Aegilotriticum requienii .
As for why they do it this way...Im still unsure since plants have much more unique cases in hybrids (the biological species concept is difficult to justify in many cases) but my labs PIs are actually meeting with some of WFO representatives this Monday over in the EU so I'm building a list of questions about the nature of their catalogue to ask about and hopefully gain some insight on, so this will definitely be part of it.
@jtmiller28 very much would like to hear the backstory on the WFO hybrid naming conventions. Thanks for being thorough on this. Formatting names are pretty important I think, especially for integrating lots of datasets.
Whitespace: "In the wild" names exist with and without space between hybrid sign and specific epithet. I think both gnparser and gbif parser desided on having a space for the same reason: it helps to prevent conversion by mistake of the hybrid sign into a letter 'x' by a human or OCR, creating a misspelling that cannot be fixed easily. The situtaion also gets more complicated when hybrid sign is completely omitted, or replaced with 'X' or 'x' characters.
There are 3 "canonical forms" in gnparser.
"Stemmed" -- it removes all hybrid signs and suffixes from specific epithets and it should be used for lexical reconsiliation.
"Simple" -- it preserves suffixes, but removes hybrid signs. It can be used for calculating editing distances and for less accurate lexical reconciliation.
"Full" -- with hybrid signs and other details, it is usually used for displaying a name without aftorship.
I do lexical reconciliation by comparing stemmed canonical form of input with stemmed canonical forms of aggregated names. This way hybrid signs do not prevent matching. Sadly, I do think that preprocessing has to be included, or some matches will not happen.
Things below can be scratched :) Hm, I imagine the problem is with old Scala gnparser @jhpoelen:
http://parser.globalnames.org/?format=html&names=Cylindropuntia+%C3%97fosbergii&with_details=on
Go-written current parser does parse hybrids correctly. There were a lot of improvements in parsing quality since Scala version.
Names with hybrid signs have a lot of vairations. Quite often people use 'x' and 'X' characters instead of UTF-8 multiplication character, somtimes they add spaces, sometimes not. Also there are "notho" names, where hybrids are implied.
It appears that using gn-parse within nomer will assume that specificEpithets that begin with 'x' are hybrids resulting in a failure to match.
Ex. echo -e "\tCylindropuntia ×fosbergii\t(C.B.Wolf) Rebman, M.A.Baker & Pinkava" | nomer replace gn-parse | nomer append wfo
Outputs: Cylindropuntia × fosbergii (C.B.Wolf) Rebman, M.A.Baker & Pinkava NONE Cylindropuntia × fosbergii
If manually done without using gn-parse mapping is successful
Ex. echo -e "\tCylindropuntia ×fosbergii" | nomer append wfo
Outputs: Cylindropuntia ×fosbergii HAS_ACCEPTED_NAME WFO:0000632373 Cylindropuntia ×fosbergii species Cylindropuntia ×fosbergii WFO:0000632373 species http://www.worldfloraonline.org/taxon/wfo-0000632373