gsautter / goldengate-imagine

Automatically exported from code.google.com/p/goldengate-imagine
Other
1 stars 0 forks source link

taxonomicName: not detected CheckList_article_21049 FFA6FFD6FD73FFC3874EFFD8FFE56E14 #547

Open myrmoteras opened 5 years ago

myrmoteras commented 5 years ago

CheckList_article_21049

I fixed all of them manually

n this article, many taxonomic names have not bee properly detected image

image

image

and when extending the names and run pars taxon name, the result is not complet

image

gsautter commented 5 years ago

A CoL lookup for family "Reduviidae" gives a pretty good idea why we're getting this result: http://www.catalogueoflife.org/col/webservice?response=full&name=Reduviidae

Now if you go through the child genera (under <child_taxa>), you see that most of the missed genera are not listed there. This is a hard one with plain binomials only (i.e., no subspecies or variety whose most significant epithet would bear an explicit label, "subsp." or "var."), and no original descriptions or recombinations whose taxon status labels would provide any sure-fire hints. In brief, there is preciously little to go at in terms of recovering taxon names not vetted by the catalogs.

Using the italics alone, maybe in combination with the presence of fairly regular authorities might work in this case, but would likely incur an enormous number of false positives in many other articles, especially ones richer in data per taxon. I have to think about how we could attack this.

gsautter commented 5 years ago

I've been running FAT with the fixes from #541 on this checklist several times now, and it looks like it's working just fine. All the taxa come up, all the families as well, and also most of the subfamilies and tribes. Either CoL added a major update in the past couple of days, or it just works now.

Do you have more of these CheckList articles to run a few more tests with? And anyway, since this is a Pensoft journal, shouldn't we be harvesting it as TaxPub?

myrmoteras commented 5 years ago

Checklist is PDF only. No taxpub. You can get any of the checklist articles from the checklist journal site at Pensoft. It makes more sense though right now to use EJT articles with botanical content.

gsautter commented 5 years ago

OK, fair enough. However, as stated in #541 , it looks as though FAT handles botanical names pretty well now.

The specific challenge with checklists (in general) is that they barely contain anything that helps FAT detect formerly unknown (to CoL, IPNI, and GBIF that is) names: No "new " status labels, barely any epithets of non-primary ranks, let alone labeled ones, etc. That means checklists require alternative approaches to verifying or excluding non-cataloged genera and species, e.g. style consistency (either both of genus and species are in bold and/or italics in a binomial, or neither one), as implemented in #541. Another idea I have yet to implement is the position of potential taxon names in a text block. Checklists barely ever mention any taxon names except as taxa proper (for what I've seen), i.e., no in-treatment discussion of differences to or description of interaction with other taxa. This puts the taxa in a unique position, and the frequently-seen restriction to primary ranks from genera downward (i.e., genera and plain binomials) deprives FAT of all of its usual clues (e.g. "subsp." epithet label inside name or "spec. nov." label right after it). All that remains is catalog lookups and font style, plus the aforementioned position within a text block. An interesting puzzle for sure ...

I'm aware this is none of our primary concerns right now, so I won't bother for the time being. However, in the long haul checklists are a treasure trove of non-cataloged names and occurrence data alike, so we might want to keep this in mind for later ... there is a lot of data (and references to original descriptions) to be harvested from checklists.

gsautter commented 5 years ago

One more question: How common, if at all, are checklists outside zoology?

I see floras are kind of checklists as well (in their summarizing and subsuming nature), but way more detailed, and less restricted to "Aus bus" binomials only (i.e., richer in the explicit taxon name clues listed in my previous post). But this question is more in terms of potentially applicable zoology specific filters, e.g ones removing false positives like "Parana Forest" (with "Forest" as the authority and "Parana" being a valid Hymenoptera genus whose parent family Braconidae is even mentioned explicitly in the article, see http://www.catalogueoflife.org/col/webservice?response=full&name=Parana) from the example checklist of this issue.