BigBatch processing: issues to cover

myrmoteras commented 1 year ago

Batch processing of all the treatments should include the following issues

Accession numbers
SubSubSection types normalization
vernacularName
COL IDs on treatment taxon names and cited taxon names

flsimoes commented 1 year ago

Taxonomic status normalization (e.g. "sp. nov.", "sp. n." and "n. sp.")
Normalization of Animalia/Metazoa

gsautter commented 1 year ago

A few more point:

remove generic number annotations
normalize annotation types (see full type list for details)

gsautter commented 1 year ago

One more point that just came up in our discussion here in Oslo

annotate individual keywords
add keywords to article stats

flsimoes commented 1 year ago

Species defined as "sp. 1", "sp. 2", "sp. 3", etc should be "undetermined" instead

myrmoteras commented 1 year ago

Species defined as "sp. 1", "sp. 2", "sp. 3", etc should be "undetermined" instead

no, we should make them status=undet.. sp. 1 is a defined name that often later is cited, for example in a revision where the specimen then gets a taxonomic name.

myrmoteras commented 1 year ago

clean up taxonomic status and come up of a reference vocabulary for at least the common terms, like sp. nov. https://tb.plazi.org/GgServer/srsStats/stats?outputFields=tax.status&groupingFields=tax.status&format=HTML

myrmoteras commented 1 year ago

find a solution to normalize types (typeStatus in material citation attributes to enable, among others, more efficient searches https://tb.plazi.org/GgServer/srsStats/stats?outputFields=matCit.typeStatus&groupingFields=matCit.typeStatus&format=HTML

if we do the attributes, we should there start to use a reference vocabulary

gsautter commented 1 year ago

Species defined as "sp. 1", "sp. 2", "sp. 3", etc should be "undetermined" instead

no, we should make them status=undet.. sp. 1 is a defined name that often later is cited, for example in a revision where the specimen then gets a taxonomic name.

@myrmoteras that's a lot harder to filter by, and far more prone to incurring incorrect cross-article matches ... species="undefined" and species="undetermined" are way easier to catch and filter than a somewhat arbitrary scheme like "sp. 1, sp. 2, sp. 3" or "sp. A, sp. B, sp. C" or "sp. ", as the latter require pattern matching to a certain extent, which vastly complicates query processing. Plus, we had agreed on the "undefined", "unknown", "undetermined", "uncertain" scheme, and implemented respective mechanics throughout our systems ... it's only older data that we need to catch up to that level.

flsimoes commented 1 year ago

Species defined as "sp. 1", "sp. 2", "sp. 3", etc should be "undetermined" instead

no, we should make them status=undet.. sp. 1 is a defined name that often later is cited, for example in a revision where the specimen then gets a taxonomic name.

@myrmoteras that's a lot harder to filter by, and far more prone to incurring incorrect cross-article matches ... species="undefined" and species="undetermined" are way easier to catch and filter than a somewhat arbitrary scheme like "sp. 1, sp. 2, sp. 3" or "sp. A, sp. B, sp. C" or "sp. ", as the latter require pattern matching to a certain extent, which vastly complicates query processing. Plus, we had agreed on the "undefined", "unknown", "undetermined", "uncertain" scheme, and implemented respective mechanics throughout our systems ... it's only older data that we need to catch up to that level.

That's what I had understood as well...

myrmoteras commented 1 year ago

@flsimoes @gsautter can you point out a couple of examples, to make sure we talk about the same?

in my view, in the taxonomicName attributes, Corvus sp. 1 should be attributed

genus = Corvus
species = sp. 1
status = undefined

In taxonomy, sp. 1have a connotation, that is that there is a species, that has no proper name. It is referred to later Corvus sp. 1 in Treatment UUID=XXX in publication UUID=YYYY.

This also allows to differentiate between sp. 1 and sp. 2 in a genus in the same publication.

flsimoes commented 1 year ago

@flsimoes @gsautter can you point out a couple of examples, to make sure we talk about the same?

in my view, in the taxonomicName attributes, Corvus sp. 1 should be attributed

genus = Corvus

species = sp. 1

status = undefined

In taxonomy, sp. 1have a connotation, that is that there is a species, that has no proper name. It is referred to later Corvus sp. 1 in Treatment UUID=XXX in publication UUID=YYYY.

This also allows to differentiate between sp. 1 and sp. 2 in a genus in the same publication.

Taxonomically speaking, I fully agree with your rationale.

In this issue we discussed the same https://github.com/plazi/conversion/issues/13#issuecomment-1072609723

gsautter commented 1 year ago

Well, the status attribute normally is for status labels like "spec. nov." or "comb. nov.", and only for that.

And while I do see the citation part, the implied connection between "Corvus sp. 1 Smith 1900" and "Corvus sp. 1 Jones 1986" just doesn't exist, generating a pattern of homonyms that is way harder to filter downstream as a well-defined single placeholder "undefined" ... Plus, if going for the species in the citation context is the goal, there always is the verbatim annotation value "Corvus sp. 1", so attributes aren't the only way of accessing this.

gsautter commented 1 year ago

Another point: remove URL prefix from DOIs to simplify comparison.

gsautter commented 1 year ago

Another point:

normalize taxonomicName status attribute to lower case, especially getting rid of all-caps (example: https://tb.plazi.org/GgServer/html/875F87E22D65FFC7D359FA429E1C79F5)

gsautter commented 1 year ago

Yet another point:

remove stale approvalRequiredFor_<XYZ> document attributes to free them up for use by user certification authority

myrmoteras commented 1 year ago

@flsimoes @tcatapano please check weather we can go ahead to run the big batch. Please check by end of March 2023.

tcatapano commented 1 year ago

@flsimoes @tcatapano please check weather we can go ahead to run the big batch. Please check by end of March 2023.

@myrmoteras: it's not clear to me what the task is. I havent been involved in this issue so don't have any context.

gsautter commented 1 year ago

@myrmoteras: it's not clear to me what the task is. I havent been involved in this issue so don't have any context.

@tcatapano the whole effort is basically a full-corpus cleanup operation to get rid of erroneous annotation types, normalize values of certain attributes, run now-standard linking jobs and QC on the older portion of our data, etc. The sheer number of documents to process sort of justifies the more thorough planning than a 200 document job would require.

Meaning to say: if you can think of anything we kind of did wrong in the past, but never got around to cleaning up the existing mess after resolving the issue for prospective documents, cleaning up the earlier documents is something to add to this list ...

flsimoes commented 1 year ago

@flsimoes @tcatapano please check weather we can go ahead to run the big batch. Please check by end of March 2023.

Yes, no further comments from my part

tcatapano commented 1 year ago

I dont think it is worth delaying anything. Go ahead.

gsautter commented 10 months ago

All implemented now, just need to (a) figure out how to bundle up all the gizmos and (b) run a few tests afterwards.

myrmoteras commented 10 months ago

Thanks Guido. Good luck and tell us when you need us to check output Donat

Get Outlook for Androidhttps://aka.ms/AAb9ysg

From: Guido Sautter @.> Sent: Monday, October 23, 2023 4:06:11 AM To: plazi/treatmentBank @.> Cc: Donat Agosti @.>; Mention @.> Subject: Re: [plazi/treatmentBank] BigBatch processing: issues to cover (Issue #73)

EXTERNAL SENDER

All implemented now, just need to (a) figure out how to bundle up all the gizmos and (b) run a few tests afterwards.

— Reply to this email directly, view it on GitHubhttps://github.com/plazi/treatmentBank/issues/73#issuecomment-1774319332, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ABDFPJA5U3WO5YVAPKUZ3KTYAXGJHAVCNFSM6AAAAAATP5SEG2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZUGMYTSMZTGI. You are receiving this because you were mentioned.Message ID: @.***>

gsautter commented 10 months ago

Intermediate result after some 60 hours: about 12K IMFs processed, another 7K in the feed hopper, basically all IMFs uploaded before Jan 01, 2019 (first half of 2016 alone is 14K IMFs).

Looking good so far, the impact on day-to-day operations is minimal (according to POA), only the export queues are (expectably) quite full, and there is the current issue with Zenodo, which is why the updates are currently held back from going there (to be addressed at the sprint).

plazi / treatmentBank

BigBatch processing: issues to cover #73