gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
38 stars 4 forks source link

certain kinds of accents break the parsing process #186

Closed abubelinha closed 2 years ago

abubelinha commented 3 years ago

Compare these three versions of a fake quadrinomial taxon name, and the results returned by gnparser:

Galega officinalis (L.) L'Hèr. subsp. mackayana (O'Flannagan) Mc Inley var. petiolata (È. Neé) Brüch. cardinality=4 , quality=1

Galega officinalis (L.) L´Hèr. subsp. mackayana (O'Flannagan) Mc Inley var. petiolata (È. Neé) Brüch. cardinality=2 , quality=4

Galega officinalis (L.) L`Hèr. subsp. mackayana (O'Flannagan) Mc Inley var. petiolata (È. Neé) Brüch. cardinality=2 , quality=4

The first author (L'Hèr.) could sometimes be found with a wrong version of the accent/tick/quote mark between L and H, as in the second and third examples (acute accent and grave accent, instead of apostrophe). Those would break the parsing process, so gnparser is detecting them as binomials instead of quadrinomials, and returning a long unprocessed tail.

It would be great if gnparser could detect those ` and ´ characters and replace them by ' before parsing.