Closed tobymarsden closed 2 years ago
Hold on. There is something odd here.
Hyacinthoides non-scripta was reported as one of these cases, but current version of the online parser (v1.5.5) is already resolving it correctly (quality 1)
But the others @tobymarsden mentions now are getting quality 4 (unparsed tails) What's the explanation for this different behaviour of gnparser with similar epithets?
for these specific names I quess we need a look-ahead with '-'
non\b can be the last word in a name string, word with space, word with some other non-letter(,
, .
, :
etc.).
There is a broader situation where names like "Aus bus (non Linnaeus)" would benefit from properly parsed "non", but it can be addressed in a separate issue.
@dimus considering the absence of lookarounds in golang's regex, this is ugly but appears to work:
var notesRe = regexp.MustCompile(
`(?i)\s+((environmental|samples|species\s+group|species\s+complex|clade|group|author|nec|vide|fide)\b|non[^a-zA-Z-]).*$`,
)
Have I missed anything?
(non
is already in the lastWordJunkRe
regex so ignoring that here).
yes, lets try it this way, looks like lookahead is not included for performance reasons
Currently names such as
Hyacinthoides non-scripta
have to be special-cased becausenon
is a stopword.There are also a bunch of these names which are not currently handled:
The most conservative way of handling this would be to change the
non
stopword intonon\s
-- this would retain the current behavior in the case of inputs such asXiphipops fisheri (non Snyder, 1904)
but allow epithets starting withnon-
to be parsed.