gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
39 stars 4 forks source link

Infraspecific rank "f." not parsed correctly under certain conditions #147

Closed havardo closed 3 years ago

havardo commented 3 years ago

Parsing the name Cymbalaria muralis G.Gaertn, B.Mey. & Schreb. f. toutonii (A.Chev.) Cuf. appear not to successfully identify "f." as a rank.

However, if "f." is replaced with "var.", the parsing is successful. Cymbalaria muralis G.Gaertn, B.Mey. & Schreb. var. toutonii (A.Chev.) Cuf.

Examples using the rank form, but without authors, seems to work fine Picea glauca var. albertiana f. conica

BTW: Great project !!!

dimus commented 3 years ago

It happens because f. in Cymbalaria muralis G.Gaertn, B.Mey. & Schreb. f. toutonii (A.Chev.) Cuf. might mean forma or filius and there is no way to know which one it is

https://github.com/gnames/gnparser#names-with-filius-icn-code

dimus commented 3 years ago

I am closing this issue, as I do not have a good algorithm to go forward in distinguishing forma and filius in such cases. Gladly, such occations are pretty rare, and gnparser issues a warning when it happens. So I guess the solution for now to filter results by that warning and adjust results manually.

havardo commented 3 years ago

Thanks for a swift response @dimus and a fair point. The code of nomenclature is certainly flawed in this case. There might be a statistical argument of favouring forma over filius in cases like this, but as you say, they are rare.

havardo commented 3 years ago

Hi @dimus, I was just wondering if the principle below might work to identify forma or filius ?

On the basis that all infraspecific names are lower case and not punctuated, we can say:

If an f. is preceded by a lowercase text and the text is not a recognized rank nor precided by a puncuation mark (Accidental lower case authorship), we can assume that the f. represents the forma rank and not filius.

As you say, these cases are rare.

dimus commented 3 years ago

Hi @dimus, I was just wondering if the principle below might work to identify forma or filius ?

On the basis that all infraspecific names are lower case and not punctuated, we can say:

If an f. is preceded by a lowercase text and the text is not a recognized rank nor precided by a puncuation mark (Accidental lower case authorship), we can assume that the f. represents the forma rank and not filius.

As you say, these cases are rare.

@havardo, as I understand we already have this, can you give an example?

For example in tests in 'filius' section there is Amelanchier arborea f. hirsuta (Michx. f.) Fernald where both f. present. I think the problem with forma and filius arises in cases where even human cannot say what it is without preliminary knowledge.

havardo commented 3 years ago

Thanks for taking the time looking into this a bit further @dimus. The examples below are all forma and adhere to the principles described above.

The preceding text is lowercase and not punctuated (Hence, not author nor rank).

The parser is reporting them all as ambiguous

Sanguinaria canadensis L. f. multiplex (E.H.Wilson) Weath.
Rosa banksiae R.Br. f. lutescens Voss
Prunus cerasifera Ehrh. f. stipitata Bregadze
Cupressus obtusa (Siebold & Zucc.) F.Muell. f. formosana (Hayata) Clinton-Baker
dimus commented 3 years ago

For example, if we look at Sanguinaria canadensis L. f. multiplex (E.H.Wilson) Weath.

Without additional information, how can we conclude, that f. means Sanguinaria canadensis forma multiplex and not L. filius?

havardo commented 3 years ago

The text multiplex is written in lowercase and has now punctuation. We can therfore conclude that the text is neither authorship nor a rank. Hence, the text must be an infraspecific name. We can then conclude that the preceding f. Is forma and not filius.

This approach will not work if an authorship is accidently written in lowercase, but then the name is noncompliant anyway.

dimus commented 3 years ago

We cannot be sure, for example Ficus aspera Forster f. nota Blanco has Forster f. as an author. However I think you are right that when such pattern happens, the probability that f. means forma is much higher than if f. means filius.

So it is better to parse f. as forma, with the same warning as before.

dimus commented 3 years ago

I created a new issue at https://github.com/gnames/gnparser/issues/154