Closed tobymarsden closed 2 years ago
I think it is a good feature. I do have a concern though. GNparser serves not only as a parser, but also as a sort of 'linter' which should break on strings that are entered as a scientific name by mistake.
If to check GNverifier name-strings for names with 2-letters before dash, most of the results are junk. So I propose to limit 2-letter prefixes to a hardcoded subset, disalowing anything else. If more names show up later they can be added to the list. For example, such approach exists for 2-letter generic names. From the file below it looks like only these "prefixes" happen in the wild
De-
Eu-
Le-
Ne-
@dimus Thanks!
I've now completed parsing all of the Kew names and indeed it turns out that Le-monniera
was the only one like this gnparser struggled with. Which means that (excluding six names which are wrong in the source data) once this issue and #203 are resolved gnparser will parse all 1,197,503 names in the Kew dataset.
I'll update the PR to special-case these four prefixes.
I've now completed parsing all of the Kew names and indeed it turns out that Le-monniera was the only one like this gnparser struggled with.
Great news @tobymarsden! Closed this with dc67aaf but put PR instead of the issue in the comment by mistake, making release now
Parsing fails for genera that start with a 2-letter segment, e.g.
Le-monniera
.