gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
40 stars 5 forks source link

Parsing hyphenated genus names starting with a 2-letter segment #205

Closed tobymarsden closed 2 years ago

tobymarsden commented 2 years ago

Parsing fails for genera that start with a 2-letter segment, e.g. Le-monniera.

dimus commented 2 years ago

I think it is a good feature. I do have a concern though. GNparser serves not only as a parser, but also as a sort of 'linter' which should break on strings that are entered as a scientific name by mistake.

If to check GNverifier name-strings for names with 2-letters before dash, most of the results are junk. So I propose to limit 2-letter prefixes to a hardcoded subset, disalowing anything else. If more names show up later they can be added to the list. For example, such approach exists for 2-letter generic names. From the file below it looks like only these "prefixes" happen in the wild

De-
Eu-
Le-
Ne-

2char-dash.txt

tobymarsden commented 2 years ago

@dimus Thanks!

I've now completed parsing all of the Kew names and indeed it turns out that Le-monniera was the only one like this gnparser struggled with. Which means that (excluding six names which are wrong in the source data) once this issue and #203 are resolved gnparser will parse all 1,197,503 names in the Kew dataset.

I'll update the PR to special-case these four prefixes.

dimus commented 2 years ago

I've now completed parsing all of the Kew names and indeed it turns out that Le-monniera was the only one like this gnparser struggled with.

Great news @tobymarsden! Closed this with dc67aaf but put PR instead of the issue in the comment by mistake, making release now