gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
38 stars 4 forks source link

Epithets starting with `non` are not parsed correctly #211

Closed tobymarsden closed 2 years ago

tobymarsden commented 2 years ago

Currently names such as Hyacinthoides non-scripta have to be special-cased because non is a stopword.

There are also a bunch of these names which are not currently handled:

Artocarpus altilis var. non-seminiferus
Artocarpus incisus var. non-seminiferus
Asarum maculatum var. non-maculatum
Asarum versicolor var. non-versicolor
Hyacinthus non-scriptus
Hylomenes non-scripta
Grossularia non-scripta
Scilla non-scripta subsp. hispanica
Usteria non-scripta
Anthericum non-ramosum
Anthericum non-scriptum
Endymion non-scriptus
Streptanthera cuprea var. non-picta
Scilla non-scripta subsp. cernua
Torreya grandis f. non-apiculata
Rosa ×pouzinii subsp. nonhispida
Cotoneaster non-shan
Ribes non-scriptum

The most conservative way of handling this would be to change the non stopword into non\s -- this would retain the current behavior in the case of inputs such as Xiphipops fisheri (non Snyder, 1904) but allow epithets starting with non- to be parsed.

abubelinha commented 2 years ago

Hold on. There is something odd here.

Hyacinthoides non-scripta was reported as one of these cases, but current version of the online parser (v1.5.5) is already resolving it correctly (quality 1)

But the others @tobymarsden mentions now are getting quality 4 (unparsed tails) What's the explanation for this different behaviour of gnparser with similar epithets?

dimus commented 2 years ago

for these specific names I quess we need a look-ahead with '-' non\b can be the last word in a name string, word with space, word with some other non-letter(,, ., : etc.).

There is a broader situation where names like "Aus bus (non Linnaeus)" would benefit from properly parsed "non", but it can be addressed in a separate issue.

tobymarsden commented 2 years ago

@dimus considering the absence of lookarounds in golang's regex, this is ugly but appears to work:

var notesRe = regexp.MustCompile(
    `(?i)\s+((environmental|samples|species\s+group|species\s+complex|clade|group|author|nec|vide|fide)\b|non[^a-zA-Z-]).*$`,   
)

Have I missed anything?

(non is already in the lastWordJunkRe regex so ignoring that here).

dimus commented 2 years ago

yes, lets try it this way, looks like lookahead is not included for performance reasons