gnames / bhlindex

BHLindex is used by Biodiversity Heritage Library to create their scientific names index
MIT License
9 stars 1 forks source link

Name duplication #69

Open Teinostoma opened 3 months ago

Teinostoma commented 3 months ago

BHL is listing several names two or three times. For example, https://www.biodiversitylibrary.org/item/98172#page/97/mode/1up has Scapharca, Scapharca Gray, 1847, and Scapharca J. E. Gray, 1847; Anadara is listed by itself and as J. E. Gray, 1847.

There are also the usual issues of badly inadequate OCR, the challenge of distinguishing between words used that are homonyms of taxonomic names and actual scientific names (e.g., Florida and Alligator are geographic terms in the text, not taxa), not recognizing most of the species names that actually are on the page, and claiming a species is present that isn't in the text in any form. The latter seems to reflect an OCR error misreading a word in the text as matching a common specific epithet and the program somehow picking a genus to go with it. It might help some to tell the program not to consider any taxon described later than the date of publication.