gnames / gnparser

GNparser normalises scientific names and extracts their semantic elements.
MIT License
40 stars 5 forks source link

Feedback from Bob Mesibov #245

Closed dimus closed 1 year ago

dimus commented 1 year ago

(1) One problem is that gnparser adds quotes when I use the TSV output option. Originals in the Naturalis Mollusca list, followed by the gnparser output:

"""Glyptothauma"" cf ankasana" | """""""Glyptothauma"""" cf ankasana""" """Glyptothauma"" cf. ankasana" | """""""Glyptothauma"""" cf. ankasana""" """Glyptothauma"" cf. ankasana de Winter, 1996" | """""""Glyptothauma"""" cf. ankasana de Winter, 1996""" """Glyptothauma"" sp. 2" | """""""Glyptothauma"""" sp. 2""" "Sepietta oweniana (D""Orbigny, 1839-1841)" | """Sepietta oweniana (D""""Orbigny, 1839-1841)""" "Sepiola atlantica D""Orbigny, 1839-1842" | """Sepiola atlantica D""""Orbigny, 1839-1842""" """Triphora"" osclausum Rolán & Fernández-Garcés, 1995" | """""""Triphora"""" osclausum Rolán & Fernández-Garcés, 1995"""

(2) Another issue is that "D'Orbigny" in the original is "D’Orbigny" in the gnparser output. Why change UTF-8 27 to e2 80 99?

(3) regex says reject, gnparser says OK (regex_yes_gnparser_no file)

Please see. A lot of these end with "cf/CF" or "ms/MS".

(4) regex says OK, gnparser rejects (regex_OK_gnparser_no file)

Please see. It looks like gnparser doesn't like "Genus (Subgenus)", which I would have thought OK, and worries about "Author in Author, Year". Note also that the Dutch-persons at Naturalis have used "Von dem Busch" rather than "von dem Busch".

regex_OK_gnparser_no.txt regex_yes_gnparser_OK.txt

dimus commented 1 year ago

Hm, Genus (Subgenus) should work according to these tests:

https://github.com/gnames/gnparser/blob/master/testdata/test_data.md#combination-of-two-uninomials


Name: Aaleniella (Danocythere)

Canonical: Aaleniella subgen. Danocythere


Name: Cordia (Adans.) Kuntze sect. Salimori

Canonical: Cordia sect. Salimori


Name: Calathus (Lindrothius) KURNAKOV 1961

Canonical: Calathus subgen. Lindrothius


Can you add examples that show your cases?

dimus commented 1 year ago

Can you please show examples for worries about "Author in Author, Year"

dimus commented 1 year ago

Looks like I need to add "dem" as an author word: Von dem Busch. Ill check if dem ever happens as a specific epithet.

Mesibov commented 1 year ago

@dimus, sorry, I wasn't paying attention to this issue. The "Genus (Subgenus)" and "Author in Author, Year" cases I was thinking of can be found in in https://github.com/gnames/gnames/files/12587991/regex_OK_gnparser_no.txt. Both forms throw up a quality rating of 2.

Please also note that in "Eutrochatella babei (Arango y Molina, 1876)", the "y" is part of the author's surname, so the quality 2 indicator "Spanish 'y' is used instead of '&'" does not apply.

dimus commented 1 year ago

Thank you @Mesibov for explanation. I do think that y should decrease the quality, because there are many other languages that people can use for the and word, and doing so will create a mess. So I decided to limit and words to and and &. I personally would prefer et though :)

I am not sure what to do if y is a part of the Author name, I guess I do need to put exceptions and hardcode such authors into gnparser.

Added https://github.com/gnames/gnparser/issues/251

dimus commented 1 year ago

In case of Genus (Subgenus) and Author in Author the quality is decreased after discussion with Paddy Patterson about these two issues. For botanical names 'Author in Author' is actually valid, so I am on the fence about it. For Genus (Subgenus) I can double check with ICZN folks.

dimus commented 1 year ago

I did try to address most of the problems in v1.7.5