gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
38 stars 4 forks source link

Relax parsing of 'nudum' and 'non' #209

Closed tobymarsden closed 2 years ago

tobymarsden commented 2 years ago

I present this for discussion -- it may be hopelessly naïve, but restricting the preprocessing of non to instead non and nudum to nomen nudum allows us to remove the special casing of e.g. Hyacinthoides non-scripta, Stilifolium nudum etc and not add well over a hundred more.

I like the elegance here but if it's going to also parse a huge chunk of junk I am of course very happy to add add special cases instead...

dimus commented 2 years ago

I think you are totally right about nudum. Not sure why did I pick nudum by itself as a terminator, it was a mistake. Howerver nomen nudum is not always in this form, in the wild there is also "nom. nudum", "nom.nudum", however, I suspect they would cause unparsed tail anyway.

dimus commented 2 years ago

non is a more complicated case. I suspect it can actually be quite useful to be parsed, for example in cases like

Xiphipops fisheri (non Snyder, 1904)

Can you remove non cases from the PR? I think they require their own issue and more thought

tobymarsden commented 2 years ago

@dimus Yes, you're right! In fact swapping in nomen\s+nudum for nudum does nothing because nomen is a stopword anyway, so I've removed it entirely.

This PR is now nudum only, and I'll open an issue for non. Thanks!