gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
40 stars 5 forks source link

`Crisia romanica Zágoršek` does not parse correctly #259

Closed dimus closed 8 months ago

dimus commented 8 months ago

@LocoDelAssembly found that Crisia romanica Zágoršek does not recognize the author. The reason is the , a letter combined from s character and the diacritic character. This is a problem for all characters created this way.

LocoDelAssembly commented 8 months ago

@dimus the example turns out it was partially tweaked already. The "á" should actually be "á".

Actual name is Crisia romanica Zágoršek Silye & Szabó 2008.

[Edit] Forgot to clarify that such accented 'a' is also unsupported.

dimus commented 8 months ago

Thanks @LocoDelAssembly, this is almost as a nasty problem as inserting cyrillic letters into a name (also happens). I wonder if it makes sense to normalize all strings before parsing (https://unicode.org/reports/tr15/)