gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
38 stars 4 forks source link

Option to retain diacritics? #208

Closed tobymarsden closed 2 years ago

tobymarsden commented 2 years ago

Would you be open to a PR which adds an option, disabled by default, to disable transliteration of diacritics? For my use case I'd strongly prefer that they were all retained, at least in genera and epithets.

abubelinha commented 2 years ago

@tobymarsden could you post an example of names where you would use this option, and its effect in results?

I guess I could be interested in using it but I don't really understand Go code. @abubelinha

dimus commented 2 years ago

@tobymarsden I would also be interested to understand the usecase, ICZN does not allow diacritics, while in ICN there is a quite obscure permission to use 'é' in some very specific cases.

tobymarsden commented 2 years ago

@dimus @abubelinha

I'm building two related things:

  1. a programmatic interface to multiple "trusted" (but sometimes conflicting!) sources of plant data -- for example, Plants of the World Online, World Flora Online, Red List, CITES.
  2. a system for maintaining data on living botanical collections, which accepts names as input from the user in order to, for example, accession new material and associate it with a taxon and then a name.

For (1), we have two issues:

a) We have some names from a data source such as World Flora Online which contain diaereses, e.g. Hieracium kalsoeënse. The use of diaereses is permitted under the ICN. Transliterating it to e is reasonable as the mark doesn't change the spelling, and this is useful for matching purposes. But leaving it as-is (particularly for display purposes) is also reasonable because a reliable source included it and the ICN allows it. A flag allows the user to make the judgement according to their use case.

b) There are other names such as Anthurium gudiñoi which are not valid under the ICN, but they were still referenced somewhere notable -- in this case, on a type specimen sheet in the herbarium at Missouri. Normalizing the name in every respect other than transliteration would still be useful, though I can't get particularly exercised about it. Similarly with Senecio nordenskjöldii -- there are many more sources which reference Senecio nordenskjoldii than Senecio nordenskjoeldii, so the transliteration to oe doesn't help here when matching names.

The main thing here is that we're working with "trusted" data, not cleaning up junk. When getting names from a somewhat authoritative source, we want to avoid providing an interpretation as far as possible. There may be a few problem names parsed, but that needs to be solved further up the stack, so to speak, and not in our parsing phase.

Use case (2) is similar -- in our system, the user input is to be respected, at least where diaereses are concerned. If they want to refer to names like Hieracium kalsoeënse, that's fine and we need to be able to normalize that without removing the diaeresis, which would be overstepping.

To sum up, I'm ambivalent about having an option to disable transliteration entirely, though this is simplest to implement and there's an argument that it's helpful when dealing with names from normally-reliable sources. However I do really need to be able to retain diaereses from the source data. (We'll actually end up using both -- matching on a version with no diaereses but normalizing for display with the original marks.)

dimus commented 2 years ago

@tobymarsden I think I understood, so yes, lets add the flag.

So in this case you would need to keep diacritics in normalized version, canonical forms (stemmed included), details?

Your mention of different transliterations is also valid concern. I think we can talk about it at #201

dimus commented 2 years ago

The use of diaereses is permitted under the ICN.

My bad, I did not double check in the code, and trusted memory incorrectly, not 'é' but diaresis 'ë',

tobymarsden commented 2 years ago

@dimus Thanks!

I think it does need to be everywhere, yes.

On reflection I'm thinking that making the option "preserve diaereses" would be more conservative and less of a departure for gnparser as these are referenced in the ICN. More complex of course because it's transliterating everything that doesn't match (I think) [aeiou][ëï] but I could give it a shot and you can see what you think.

I can only find a tiny handful of examples where other diacritics have been used in non-junk names anyway.

dimus commented 2 years ago

There are definietly legacy names with diacritics, and other inconsistencies. For example Algaebase has a few names with capitalized epithets for "patronyms".

One possible solution is to keep diaereses in normalized and canonical full and, may be canonical simple, but remove it from canonical stemmed. Also keep it in detail.

I used to preserve ë for for all parsed names, but it is not compatible with ICZN names, so now I remove it.

dimus commented 2 years ago

Another complication are names with latinized german words where diaeresis umlaut characters ö, ä, ü, do suppose to change spelling during transliteration, so I think only ë is safe.

And of course, some people follow rules of transliteration and others dont, so we have several alternative spellings for legacy names with diacritics