gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
38 stars 4 forks source link

Parsing genus names with more than one hyphen #203

Closed tobymarsden closed 2 years ago

tobymarsden commented 2 years ago

Parsing fails with the genus Prunus-lauro-cerasus. Though this is a synonym, it does appear in the literature so parsing would be helpful, and I can't see any prohibitions in the ICBN against more than one hyphen in a genus name.

dimus commented 2 years ago

Good catch @tobymarsden. I would like to have this rule more strict.

I checked gnverifier names with ripgrep: rg "^([\p{L}]+-[\p{L}]+){2,}.*?\b" all-names-2021-11-14.txt and it looks like there is nothing reasonable with more than 2 dashes, and only these genera seem to be 'real enough` (with various capitalizations):

Iulo-eido-coprolites
Johnson-sea-linkia
Para-bary-thelphusa
Para-lio-thelphusa
Para-peri-thelphusa
Prunus-lauro-cerasus
Tsugo-piceo-picea

I see nothing useful with 3 or more dashes.

Searching with rg "\b[a-z]([a-z]*-[a-z]*){2,}.*?\b" all-names-2021-11-14.txt gives quite a few 2-dash specific epithets, and there are even a few that seem to be real when I search for 3 dashes or more with rg "\b[a-z]([a-z]*-[a-z]*){3,}.*?\b" all-names-2021-11-14.txt

~~So I am on a fence about this one. It seems that allowing up to 2 dashes would keep most of false positives unparsed, but also would ignore 2 epithets that have more than 2 dashes. Let me talk to our botanists and zoologists on Monday.~~

I recalled that we did have this conversation about epithets already with out taxonomists, and, as a result, multi-dashes are allowed. So I think for genera it makes sense to limit them to 2 dashes for now, and if necessity arises, allow for multi-dashes. What do you think @tobymarsden?

tobymarsden commented 2 years ago

@dimus Thanks for the explanation! And on your weekend, too. I've updated the PR to accept up to two dashes for genera.

dimus commented 2 years ago

Looks good @tobymarsden I am going to add a couple of more tests after merge

tobymarsden commented 2 years ago

@dimus Amazing - thanks! Now that Kew parses I'll check World Flora 😂