Open myrmoteras opened 1 year ago
Hi Terry, To recap the status of the GGXML to JATS conversion:
The current version of the stylesheet is here:
https://github.com/plazi/ggxml2taxpub/blob/master/xslt/gg2tp_l1.xsl
And recent reports on validation errors in results are here;
https://github.com/plazi/ggxml2taxpub/blob/master/errs/sample_500_errors_20240312_frq.txt https://github.com/plazi/ggxml2taxpub/blob/master/errs/sample_500_errors_20240312_per-article.txt https://github.com/plazi/ggxml2taxpub/blob/master/errs/sample_500_errors_20240312.txt
some cursory analysis of the errors shows them to be
Given that, I suggest moving forward to implement a conversion pipeline using the current state of the transformation. Ideally the pipeline would include a post transformation validation phase which would pass through valid instances and route invalid ones elsewhere for further analysis. It's extremely unlikely that the conversion will ever be close to 100% correct, so it's essential to introduce error handling into the workflow. Further, it's likely that the source GGXML will sometimes undergo revision which might or might not require developing a procedure to handle. done, just took live the XSLT, and with it the export ... as stated earlier, once the XSLT is a go, it's down to adding a config file entry and a server restart.
The newly added exporter is now ingesting the article collection and pushing out all articles that (a) don't have TaxPub originals (which are arguably preferable over the GG XML round-tripped version), (b) have no gatekeeper objections, and (c) transform into valid TaxPub. The (b) and (c) conditions/hindrances will show up in the data transit statistics, so we'll have a list of the documents that fail to convert (condition c), as well as ones that need a few more QC checks before we deem them ready to go (condition b).
However, isn't it the case that we decided not (or not only) to provide them with JATS TaxPub, but BIOC JSON -- or at least the JSON was the preference? The JATS would be supplementary and also added to BLR as an additional file format. The BIOC should probably be generated from the GGXML anyway, relying as it does on offsets. So in the big picture the missing piece is the GGXML to BIOC conversion, so should that not be the priority? While I'm aware of the basic BIOC structure (we talked about it March), a specification of what exactly to include would be highly appreciated ... we just had the case that full GG XML with all its details was too much for WADM, and I don't want the BIOC export to run into the same issue ...
So, let me know which annotations and attributes to include, and maybe an example (e.g. based upon a treatment with a good deal of details, like https://tb.plazi.org/GgServer/html/03A8FF2FFD36FFE31FB2EB1EFDE6F7CB or https://tb.plazi.org/GgServer/html/C63D87EE5417FFFFFDCFFB34FAA0E778)
Best, Guido
Hi Donat, Sorry for this cryptic message. It says, the Guido is waiting for a new version of the XSLT to create the journal JATS that we are supposed to create and provide to SIBiLS, so we can progress with the conversion pipeline. It was not about another meeting.
We did decide to provide a JATS version, and whether we should also use a BIO-C, if we can produce and we thought to be doable.
I will schedule a meeting next Monday 5pm to discuss how to coordinate the production of JATS XML for journals.
Is this ok? sure, even though possibly no longer required, as I just took the TaxPub article export live (see previous mail) ...
What we need to provide BIOC as well, please also refer to my previous mail ... which annotations to include, which attributes? We just had the GG XML overload in WADM, and I don't exactly expect BIOC to be less verbose, so using the two example treatments I linked in my previous mail to create an example BIOC with all the desired detail would be great.
Best,
issue
For this we need a minimal level of annotations. Which one is this?
goal
define a minimal level of JATS/Taxpub for articles in general, similar to what we have for treatment taxpub for SIBiLS
solution
dependencies