gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
40 stars 5 forks source link

Parse nasty names with ambivalent specific epithets #53

Closed dimus closed 3 years ago

dimus commented 3 years ago

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/53

@diatomsRcool, @KatjaSchulz and @joelnitta found the following names:

Acrostichum nudum
Adiantum nudum
Africanthion nudum
Agathidium nudum
Aphaniosoma nudum
Aspidium nudum
Athyrium nudum
Bembidion satellites
Blechnum nudum
Bolivina prion
Boreophilia nomensis
Bottaria nudum
Erateina satellites
Gnathopleustes den
Ithomeis satellites
Lycopodium nudum
Navicula bacterium
Nephodia satellites
Nephrodium nudum
Paralvinella dela
Phelodon nomene
Polypodium nudum
Polystichum nudum
Psilotum nudum
Ruteloryctes bis
Selenops ab
Tortolena dela
Trachyphloeosoma nudum
Xestia cfuscum
Zodarion van

We need to double check that they are 'real' and whitelist the real ones in rules O.o

dimus commented 3 years ago

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/43

mentioned in issue #86

dimus commented 3 years ago

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/44

changed the description

dimus commented 3 years ago

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/45

changed the description

dimus commented 3 years ago

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/46

Thanks for the kind words and more names for this ticket @joelnitta. I added them to the description of the issue.

dimus commented 3 years ago

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/47

changed the description

dimus commented 3 years ago

created by @joelnitta at https://gitlab.com/gogna/gnparser/-/issues/48

Thanks for the great program! This is a lifesaver for taxonomic workflows. These are some additions (all names of ferns) for the whitelist whenever that happens:

Acrostichum nudum, Adiantum nudum, Aspidium nudum, Athyrium nudum, Blechnum nudum, Lycopodium nudum, Nephrodium nudum, Polypodium nudum, Polystichum nudum, Psilotum nudum

dimus commented 3 years ago

created by @dimus at https://gitlab.com/gogna/gnparser/-/issues/49

Also from https://github.com/GlobalNamesArchitecture/gnparser/issues/331

Looks like "le" is used as part of an author and as part of a specific epithet. I also have a suspision that names with "le" as specific epithet are really have epithet separated by a space!

http://gni.globalnames.org/name_strings?commit=Search&page=2&search_term=sp%3Ale

We probably should make a dictionary where it is an author and where it is a name, and normalize them accordingly... Big job.

KatjaSchulz commented 3 years ago

Hi Dima,

Here are a few more names for your whitelist. These are all from the current version of the Catalogue of Life (COL-2021-06-10):

Navicula bacterium (diatom) Xestia cfuscum (moth) Bolivina prion (foraminifer) Bembidion satellites (beetle) Erateina satellites (moth) Ithomeis satellites (butterfly) Nephodia satellites (moth)

Also, names don’t get parsed if the generic name is too short, but there are a few two letter genera:

Do holotrichius (beetle) Oo spinosum (arachnid) Nu aakhu (annelid)

Maybe these could get whitelisted, too?

dimus commented 3 years ago

@KatjaSchulz thanks for more 'nasty' names, I am going to bump priority up for this issue

dimus commented 3 years ago

Oh, I thought I have all two-letter genera accounted for:

TwoLetterGenus <- ('Ca' / 'Ea' / 'Ge' / 'Ia' / 'Io' / 'Ix' / 'Lo' / 'Oa' / 'Ra' / 'Ty' / 'Ua' / 'Aa' / 'Ja' / 'Zu' / 'La' / 'Qu' / 'As' / 'Ba')