gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
38 stars 4 forks source link

Space after 'Mc' added to Authorship field #213

Closed jar398 closed 2 years ago

jar398 commented 2 years ago

I was hoping to reconstruct a canonicalized scientificName from CanonicalFull + Authorship, but it looks like gnparser inserts a space after 'Mc' in names starting with 'Mc', which doesn't give a good result. I don't know whether this gnparse behavior is intentional or not so perhaps this is not a bug.

gnparser "Aus bus McDonald 2021"
Id,Verbatim,Cardinality,CanonicalStem,CanonicalSimple,CanonicalFull,Authorship,Year,Quality
5ce4e466-8583-5288-b8ad-4cfdd9edd724,Aus bus McDonald 2021,2,Aus bus,Aus bus,Aus bus,Mc Donald 2021,2021,1
tobymarsden commented 2 years ago

This behavior is specified in the tests, but if it's not intentional then this change in grammar.peg "fixes" it:

- CapAuthorWord <- AuthorUpperChar AuthorLowerChar*
+ CapAuthorWord <- AuthorUpperChar (AuthorLowerChar / AuthorUpperChar)*

The only tests it breaks are the ones showing the behavior of splitting e.g. McDunnough into two author words, but I guess it might have unintended consequences that are not currently tested for -- but figuring that out is some way beyond my talents...

dimus commented 2 years ago

@jar398 this is not intentional, another good catch!