gnames / bhlindex

BHLindex is used by Biodiversity Heritage Library to create their scientific names index
MIT License
8 stars 1 forks source link

Name occurrence verification needs #67

Open Teinostoma opened 1 year ago

Teinostoma commented 1 year ago

OCR often does very poorly on documents in BHL, and the list of names being searched for is very incomplete, at least when it comes to fossil mollusks. Authors also did not make this easy, often using idiosyncratic ways of abbreviating. As a result, both the false positive and false negative rates are very high in the documents that I am reading on BHL. A few ideas:

Is there a way to take the date of the publication into consideration? Names published after a publication was written will not be found in that publication (for example, the word lens will not be a reference to the genus Lens Simpson, 1900 in publications from the 1800's). This would help decease false positives.

Is there a way to allow users to quickly indicate "here is a name missed by the system", "this is correct", "this name finding is spurious", etc.? It would require verification to protect against trolling or errors, but could be a useful way to improve the name finding.

Is there a way to take context into account to identify higher taxonomic levels? This is especially of value for homonyms. For example, being able to search for references that contain both Auricularia and Mollusca would avoid the huge number of hits for the fungus Auricularia.

dimus commented 1 year ago

Thank you for your feedback @Teinostoma. Currently there is a hybrid approach to BHL names. There are name-finding methods that used ngram approach and BHLindex. BHLindex creates very few false positives, but misses many names with OCR errors. Names found before BHLindex using ngram did find many names with OCR errors, however they also contain a lot of false positives. We are thinking about ways to fix this problem.

Do you know a good resource for fossil mollusks names? Currently, I think, there are 3 sources of mollusks names: WoRMS, PaleoBioDB, and for old names -- Sherborn's Index Animalium.

Context, usage of dates, figuring out how to reconcile abbreviated names -- definitely a way to improve name-fidning. I am actually writing a grant exacly about that.

Implementing user-feedback for names would be out of scope of BHLindex, and more of a decision for BHL folks (@mlichtenberg, @cajunjoel). @gdower also might be interested.

Teinostoma commented 1 year ago

I believe that WoRMS has all the names from MolluscaBase, so I don't think MolluscaBase would need separate attention.

Paleobiology Database doesn't have very thorough coverage of many mollusc faunas; most of the attention has gone to "what are things you can do with this data" rather than to supporting data generation and quality control (a common problem of large biodiversity databases).

Ruhoff (https://repository.si.edu/handle/10088/5331 ) adds a couple of decades beyond Sherborn, though it is not quite as thorough. Fossils were not included in the Zoological Register for a while, so it does not help with them for the first few decades.

On Thu, Jun 15, 2023 at 8:52 AM Dmitry Mozzherin @.***> wrote:

Thank you for your feedback @Teinostoma https://github.com/Teinostoma. Currently there is a hybrid approach to BHL names. There are name-finding methods that used ngram approach and BHLindex. BHLindex creates very few false positives, but misses many names with OCR errors. Names found before BHLindex using ngram did find many names with OCR errors, however they also contain a lot of false positives. We are thinking about ways to fix this problem.

Do you know a good resource for fossil mollusks names? Currently, I think, there are 3 sources of mollusks names: WoRMS, PaleoBioDB, and for old names -- Sherborn's Index Animalium.

Context, usage of dates, figuring out how to reconcile abbreviated names -- definitely a way to improve name-fidning. I am actually writing a grant exacly about that.

Implementing user-feedback for names would be out of scope of BHLindex, and more of a decision for BHL folks @.*** https://github.com/mlichtenberg, @cajunjoel https://github.com/cajunjoel). @gdower https://github.com/gdower also might be interested.

— Reply to this email directly, view it on GitHub https://github.com/gnames/bhlindex/issues/67#issuecomment-1592986918, or unsubscribe https://github.com/notifications/unsubscribe-auth/AY5MAWGT2VS6TQSWNMGNUK3XLMAQLANCNFSM6AAAAAAZG6KK4Y . You are receiving this because you were mentioned.Message ID: @.***>

-- Dr. David Campbell Associate Professor, Geology Department of Natural Sciences 110 S Main St, #7270 Gardner-Webb University Boiling Springs NC 28017

dimus commented 1 year ago

Does Ruhoff exist as a curated digitized data? I did try to OCR it using Adobe Acrobat and found that the final result does contain a fair amount of errors F_A_Ruhoff_Mollusca_1850_1870.txt

If the OCR errors are corrected in the file (from the species epithet to the year), it would be fairly easy to convert it into a data-source

Teinostoma commented 1 year ago

I don't know of a curated version of Ruhoff; I mostly use my print copy, which doesn't help what you need much.

On Thu, Jun 15, 2023 at 5:08 PM Dmitry Mozzherin @.***> wrote:

Does Ruhoff exists as a curated digitized data? I did try to OCR it using Adobe Acrobat and found that the final result does contain a fair amount of errors F_A_Ruhoff_Mollusca_1850_1870.txt https://github.com/gnames/bhlindex/files/11763335/F_A_Ruhoff_Mollusca_1850_1870.txt

— Reply to this email directly, view it on GitHub https://github.com/gnames/bhlindex/issues/67#issuecomment-1593725955, or unsubscribe https://github.com/notifications/unsubscribe-auth/AY5MAWD72FD2YSAP2ESWGNTXLN2TLANCNFSM6AAAAAAZG6KK4Y . You are receiving this because you were mentioned.Message ID: @.***>

-- Dr. David Campbell Associate Professor, Geology Department of Natural Sciences 110 S Main St, #7270 Gardner-Webb University Boiling Springs NC 28017

dimus commented 1 year ago

@Teinostoma, I did try my best to clean up OCR for names in Ruhoff, would it be ok for you to look at the result, and tell what do you think:

https://github.com/gnames/ds-ruhoff-mollusca/blob/master/data/07-fmt-names.csv

To avoid problems with UTF-8, it is better to use LibreOffice instead of Excel

I only did pay attention to the names themselves (1st and 2nd columns), the metadata after the names are not as clean. If/when they are clean enough, I can add them to https://verifier.globalnames.org and use these names in bhlindex.

I did try to reconcile them against other datasets, looks like about half of them are new for my data.

Teinostoma commented 1 year ago

It looks like a good start. I noticed two corrections for the first page - in *Nucula hammen aalensis, *hammen is an error for hammeri

and Architectonica abbottii Gabb, 1861 is missing, but that's far better than the OCR.

On Sun, Jun 18, 2023 at 7:57 PM Dmitry Mozzherin @.***> wrote:

@Teinostoma https://github.com/Teinostoma, I did try my best to clean up OCR for names in Ruhoff, would it be ok for you too look at the result, and tell what do you think:

https://github.com/gnames/ds-ruhoff-mollusca/blob/master/data/07-fmt-names.csv

— Reply to this email directly, view it on GitHub https://github.com/gnames/bhlindex/issues/67#issuecomment-1596311917, or unsubscribe https://github.com/notifications/unsubscribe-auth/AY5MAWDTDZR6DKAV5LYQD7TXL6IWZANCNFSM6AAAAAAZG6KK4Y . You are receiving this because you were mentioned.Message ID: @.***>

-- Dr. David Campbell Associate Professor, Geology Department of Natural Sciences 110 S Main St, #7270 Gardner-Webb University Boiling Springs NC 28017

dimus commented 1 year ago

thank you @Teinostoma! I added a fix https://github.com/gnames/ds-ruhoff-mollusca/commit/12994501255e1ceec2be9deb65dbfaa001291b1c

I have currently 3 levels of curation quality, "curated" -- when I am pretty sure that there is a significant effort to scrutinize data done by specialists, "auto-curated" when cleaning is done mostly by scripts, and the rest, when curation is unknown.

For example IRMNG is considered to be curated, GBIF is auto-curated, ION is not curated.

Do you think it is good enough to apply "auto-curated" to the data? It would push its matching results above 'non-curated' names.

Teinostoma commented 1 year ago

That seems the right level to me.

On Tue, Jun 20, 2023 at 7:14 AM Dmitry Mozzherin @.***> wrote:

thank you @Teinostoma https://github.com/Teinostoma! I have currently 3 levels of curation quality, "curated" -- when I am pretty sure that there is a significant effort to scrutinize data, "auto-curated" when cleaning is done mostly by scripts, and the rest, when curation is unknown.

For example IRMNG is considered to be curated, GBIF is auto-curated, ION is not curated.

Do you think it is good enough to apply "auto-curated" to the data? It would push its result above 'non-curated' names.

— Reply to this email directly, view it on GitHub https://github.com/gnames/bhlindex/issues/67#issuecomment-1598582925, or unsubscribe https://github.com/notifications/unsubscribe-auth/AY5MAWGTL6WGE37OB2FJTADXMGAY3ANCNFSM6AAAAAAZG6KK4Y . You are receiving this because you were mentioned.Message ID: @.***>

-- Dr. David Campbell Associate Professor, Geology Department of Natural Sciences 110 S Main St, #7270 Gardner-Webb University Boiling Springs NC 28017

dimus commented 1 year ago

I did attempt to detect more elusive typos, looks like about 25% of names in the publication are new to https://verifier.globalnames.org/

https://raw.githubusercontent.com/gnames/ds-ruhoff-mollusca/master/data/08-reconsile.csv