gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
38 stars 4 forks source link

Support cultivar epithets #173

Closed tobymarsden closed 3 years ago

tobymarsden commented 3 years ago

This PR introduces support for cultivar epithets in the context of uninomials, binomials, infraspecies, named hybrids and hybrid formulae.

It supports input including:

Sarracenia flava "Maxima"
Leucocasia gigantea 'Jack's Giant'
Sarracenia alata cv Black Tube
Sarracenia alata ’Black Tube’ x Sarracenia flava cv. Maxima

It doesn't support cultivar groups yet.

A few notes:

I'm not very familiar with Go and it's been many years since I used PEG (!) so while we're going to put this into production here, I do understand if you don't consider it ready for primetime.

That said, if you let me know what I've missed or any problems I've caused, I'd be happy to find time to give it another pass.

dimus commented 3 years ago

Impressive @tobymarsden! @havardo is also interested in cultivar support, and I was giving it a thought too. In my understanding cultivar naming can be quite flexible in practice, and I wonder if a solution would be to create a new peg file for cultivars specifically, and add a flag to gnparser that will clearly state that the parsing is of cultivar names. What do you think? I am worried that trying to mix cultivars, ICZN, ICN, bactrial codes all together would create ambiguous situations and increase the number of parsing mistakes.

tobymarsden commented 3 years ago

@dimus Possibly!

I like the idea of a flag to enable cultivar parsing, though I'm not familiar enough with the libraries to know whether it can be done without causing a maintenance nightmare. Parsing cultivars as presented here uses most or all of the existing grammar.

I'm pretty sure it's possible to cause parsing errors with a scientific name containing multiple apostrophes, for example, though I couldn't come up with any plausible tests to show it -- I reckon it would currently end up in the unparsed tail virtually all of the time.

havardo commented 3 years ago

Wonderful news @tobymarsden. Agree with @dimus regarding a potential flag approch. It sounds reasonable to make cultivar support an "opt in" capability if that can help on the overall design and performance.

We mainly deal with botanical taxonomy and cultivars are omnipresent, even in datasets that you would not expect to see them. So for our situation, we would allways opt in. Look forward to trying it out :)

tobymarsden commented 3 years ago

@dimus @havardo I took a quick look into adding a flag. It seems that pointlander/peg may have some issues supporting multiple peg files in the same package. That's resolvable, but I'm not sure compiling multiple grammars is the best approach anyway as I don't see how to avoid massive rule duplication?

One option would be to have a subgrammar file (e.g. cultivars.peg) with a build step that merges it into the main grammar.peg file before compilation, but then the flag is a compile-time option rather than runtime, which is barely helpful. Testing could get "interesting", plus it's creating some footguns for the future especially if more subgrammars come along later.

It would be great if there were a way to allow conditional rules at runtime within pointlander/peg but that would be... quite a thing.

A lo-fi approach would be to enable cultivar parsing in all cases but increase test coverage of non-cultivar names in the outlying cases where there is the potential for collision with the cultivar rules, and at least document these.

(I benchmarked parsing performance and couldn't detect any difference at all between the branches over 1000 runs on my MacBook Air.)

Any thoughts? Thanks for everything!

dimus commented 3 years ago

@tobymarsden @havardo, I think if cultivar parsing does cover the most common cases and does not try to cover every nook and cranny one PEG grammar might be enough. An additional flexibility can be achieved after PEG, when abstract syntax tree is processed for the output, at this point a 'cultivar flag' might be of help.

tobymarsden commented 3 years ago

@dimus Yep! I can definitely add a flag into the tree processing to disable adding the cultivars into the output. I'll update the PR.

As far as I can tell the only parsing errors that should hit non-cultivar names are when two or more apostrophes are included in the input -- I might be missing something, honestly there isn't enough caffeine in the world, but I think that should the edgiest of edge cases.

Happily, I suspect there's no way to get too clever regarding the ICNCP anyway given that epithets, group names (and the group marker itself!) can be in any language. Even the simplest capitalization rules become nonsensical when you're dealing with cultivars in Chinese...

tobymarsden commented 3 years ago

The flag --disable_cultivars is added to the PR along with some tests... as it doesn't affect the grammar parsing, just the AST processing and hence the output I've made it opt-out, but of course I can make it opt-in if you prefer.

dimus commented 3 years ago

opt in would probably have more sense, as most of the users, at least for now, are more interested in ICN, BC or ICZN codes. May be '-C' '-cultivar'?

tobymarsden commented 3 years ago

What about including the cultivar in the optional details section regardless, which shouldn't have any impact on people currently using the library as it's purely additive, but only include it in the normalized/canonical names if the --cultivar option is enabled? The library's default output would remain unchanged in that case but it will be clearer that there is fuller support for botanical names.

Otherwise I'd propose adding a new quality warning when names with cultivars are parsed without the option enabled. Without that, in the default case the cultivar name data will be lost: currently they'll get an unparsed tail warning but that will disappear if the cultivar epithets are recognized but suppressed from the output.

Thanks!

dimus commented 3 years ago

Stemming of genera in hybrid formulas is a bug, thank you for spotting and fixing it @tobymarsden

dimus commented 3 years ago

What about including the cultivar in the optional details section regardless, which shouldn't have any impact on people currently using the library as it's purely additive, but only include it in the normalized/canonical names if the --cultivar option is enabled? The library's default output would remain unchanged in that case but it will be clearer that there is fuller support for botanical names.

I like the idea of keeping canonical, normalized forms and cardinality depentent on the flag and providing cultivar details with or without the flag.

Otherwise I'd propose adding a new quality warning when names with cultivars are parsed without the option enabled. Without that, in the default case the cultivar name data will be lost: currently they'll get an unparsed tail warning but that will disappear if the cultivar epithets are recognized but suppressed from the output.

So if --canonical is not set, there is a level 2 warning "Cultivar name", and if flag is enabled the warning is not set? I think it makes sense.

dimus commented 3 years ago

In case of Ligusticum sinense cv 'chuanxiong' S.H. Qiu & et al. the authorship goes into uparsed tail currently. Is it ok for cultivars? I am going to ask around for ICNCP today, it does not look like there is a free download of its text at https://www.ishs.org/scripta-horticulturae/international-code-nomenclature-cultivated-plants-ninth-edition

tobymarsden commented 3 years ago

I like the idea of keeping canonical, normalized forms and cardinality depentent on the flag and providing cultivar details with or without the flag.

Great! I'll amend.

So if --canonical is not set, there is a level 2 warning "Cultivar name", and if flag is enabled the warning is not set? I think it makes sense.

OK! I'll implement.

In case of Ligusticum sinense cv 'chuanxiong' S.H. Qiu & et al. the authorship goes into uparsed tail currently. Is it ok for cultivars? I am going to ask around for ICNCP today, it does not look like there is a free download of its text at https://www.ishs.org/scripta-horticulturae/international-code-nomenclature-cultivated-plants-ninth-edition

I haven't added support for cultivar authorship (though uninomial and binomial authorship is retained) -- the ICNCP doesn't recommend adding cultivar authorship though it's permitted. It seems to be very rare in practice, but I don't see why it should present any particular problems. I can see if it's trivial to implement without breaking anything.

The ICNCP 9th edition is available at https://www.ishs.org/sites/default/files/static/ScriptaHorticulturae_18.pdf

dimus commented 3 years ago

I went through your PR, looks great so far @tobymarsden!

If authorship is not recommended, we can skip it for now then. Thank you for the link to ICNCP

dimus commented 3 years ago

To make testing simpler I suggest to make a separate file for cultivars, so they are tested fully with --cultivar flag

tobymarsden commented 3 years ago

@dimus PR updated:

Thanks for your perseverance as we heave this towards the line!

dimus commented 3 years ago

great @tobymarsden! Let me merge it with master, and I will start adding missing parts, docs, openapi stuff, web option. When all is settled I'll tag a new version