plazi / ggxml2taxpub

Conversion of GoldenGATE XML to JATS/TaxPub at treatment level
0 stars 1 forks source link

include ids for taxon-name #68

Open myrmoteras opened 1 year ago

myrmoteras commented 1 year ago

annotation id

when CoL;

myrmoteras commented 1 year ago

the example we used image https://tb.plazi.org/GgServer/xml/B96E131F47554175FCDFF66E7647E2D5

flsimoes commented 1 year ago

image

tcatapano commented 1 year ago

TaxPub: use object-id JATS 1.3: use named-content with @vocab-identifier and @vocab-term-identifier

note pending issue for addition of vocab attrs to tp:taxon-name https://github.com/plazi/TaxPub/issues/83

flsimoes commented 1 year ago

@myrmoteras Here's a CoL ID example

What Guido explained to me is that these have been added to new uploads since January 2023 and are now being retroactively added to the backlog via the Big Batch.

image

flsimoes commented 1 year ago

And here's one with an ENA ID

image

Source: article - https://tb.plazi.org/GgServer/summary/67368F227C03FFCAFB44FF8FFFEFAE3B treatment - https://tb.plazi.org/GgServer/html/9B0FF75A7C00FFCEFBD3F8C4FC83AEC3

tcatapano commented 1 year ago

Using current TaxPub markup, <object-type> (https://taxpub.org/v1-0/taglibrary/index.html#p=elem-object-id) in <taxon-name> seems like the best option ; so:

<tp:taxon-name>
<tp:taxon-name>
<object-id content-type="taxon-name" object-id-type="col-id">BP8N3</object-id>
B. abyssinica subsp. abyssinica</tp:taxon-name>
tcatapano commented 1 year ago

Current sample does not contain CoL ids, so this issue will have to wait for development

myrmoteras commented 1 year ago

@flsimoes @gsautter can you explain, why in this example the COL IDS are missing? is adding COL ID something new and being done as part of the batch? I think to remember that COL IDs only got stable sometimes during this year?

gsautter commented 1 year ago

@flsimoes @gsautter can you explain, why in this example the COL IDS are missing? is adding COL ID something new and being done as part of the batch? I think to remember that COL IDs only got stable sometimes during this year?

Which example are you talking about?

It's true that we've only been linking the treatment taxa and cited taxa to CoL since the start of this year, mainly, as you correctly say, because the CoL name IDs have only been stable since late 2022. The linking of new articles is part of our standard batch processing now (since early October), and the old IMFs have been linked by the Big Batch. On top of that, whenever a treatment comes into SRS (new or as an update) and the taxon or a cited taxon are missing the CoL ID, there is a lookup and the ID gets added at that point and written back to the IMF via the link write-back mechanism. All these in-routes for the links are alternatives, and all lead to the same result.

The linking of treatments that come into SRS originally was sort of preempting the Big Batch, and also serves as a means of adding the links after the fact, as at the time of the original IMF import, CoL might not have an ID for a given name just yet (and how could it in case of a newly published original description or new combination) ... this is why the linking on the way into SRS will stay active.

flsimoes commented 1 year ago

I'm guessing Donat refers to Terry's "Current sample does not contain CoL ids" which, if memory serves, is from the list of papers I sent him during the last sprint, which means pre-Big Batch.

gsautter commented 1 year ago

I'm guessing Donat refers to Terry's "Current sample does not contain CoL ids" which, if memory serves, is from the list of papers I sent him during the last sprint, which means pre-Big Batch.

Well, in that light, from more recent memory (Geneva in early November), the test set for articles is mainly focused on documents that don't contain treatments ... But since all the current linking mechanisms only ever go at treatment taxa and cited taxa (treatment citations and type taxa), it should be easy to extrapolate that non-treatment names don't get linked right now ... which explains why there is no CoL links in documents that don't have treatments in them.

We can change that policy, of course, but please keep in mind that the taxon names that currently don't get linked are also subject to far less scrutiny in QC, and have lower error severities as well, so linking might not be just as reliable if we don't also change (increase) outside-treatment taxon name QC.

myrmoteras commented 1 year ago

only treatment taxa are linked with a COL-ID

all the rest is not linked.

gsautter commented 1 year ago

Taxon names in treatment citations are linked as well, as are type species ... the only thing strictly restricted to treatment taxa is the additional link to the ENA/NCBI taxonomic backbone.

The cited effort mainly pertains to a huge number of API lookups (to ChecklistBank), not actual computations to be made ... to reduce the barrage towards the ChecklistBank API is the main objective of the current restriction to treatment taxa, treatment citations, and the likes of type species.

myrmoteras commented 10 months ago

@gsautter I don't understand the argument about lookups. I thought we have a local version of CLB and especially COL and thus would not have to use external lookups?

gsautter commented 10 months ago

@gsautter I don't understand the argument about lookups. I thought we have a local version of CLB and especially COL and thus would not have to use external lookups?

We do have a local version of CoL, yes, built every year from the annual version ... however, CLB might well get ahead, so a miss in the local CoL needs following up with a CLB lookup.

The ENA taxon ID is yet another thing, and always requires a CLB lookup, as said mapping is subject to change to too high a degree to include it in CoL local, and it would also inflate the data structure, and needlessly so for all applications except for this specific type of lookup ... I'm always thinking about Jeremy's laptop in this sort of context: CoL local has to stay sufficiently slim to work on such machines ...