plazi / ggxml2taxpub

Conversion of GoldenGATE XML to JATS/TaxPub at treatment level
0 stars 1 forks source link

minimal JATS/Taxpub for article #60

Open myrmoteras opened 9 months ago

myrmoteras commented 9 months ago

issue

  1. a large number of biodiversity publications, especially those with no treatments could be processed to a minimal degree of annotations so the could be imported in biodiversityPMC.
  2. articles that are copyrighted can not be used in biodiversitPMC. However, similar to Medline, PubMed, they could be imported by only including metadata, frontmatters, abstract, bibliographic references.

For this we need a minimal level of annotations. Which one is this?

goal

define a minimal level of JATS/Taxpub for articles in general, similar to what we have for treatment taxpub for SIBiLS

solution

  1. add license in the metadata
  2. for abstract
  3. for article with closed access, the JATS will be converted by SIBiLS to show only the parts that are shown in PubMed, that is the abstract, biliographic references
  4. ctd

dependencies

myrmoteras commented 7 months ago
  1. Add in the minimal JATS abstract
myrmoteras commented 1 week ago

Hi Terry, To recap the status of the GGXML to JATS conversion:

The current version of the stylesheet is here:

https://github.com/plazi/ggxml2taxpub/blob/master/xslt/gg2tp_l1.xsl

And recent reports on validation errors in results are here;

https://github.com/plazi/ggxml2taxpub/blob/master/errs/sample_500_errors_20240312_frq.txt https://github.com/plazi/ggxml2taxpub/blob/master/errs/sample_500_errors_20240312_per-article.txt https://github.com/plazi/ggxml2taxpub/blob/master/errs/sample_500_errors_20240312.txt

some cursory analysis of the errors shows them to be

Given that, I suggest moving forward to implement a conversion pipeline using the current state of the transformation. Ideally the pipeline would include a post transformation validation phase which would pass through valid instances and route invalid ones elsewhere for further analysis. It's extremely unlikely that the conversion will ever be close to 100% correct, so it's essential to introduce error handling into the workflow. Further, it's likely that the source GGXML will sometimes undergo revision which might or might not require developing a procedure to handle. done, just took live the XSLT, and with it the export ... as stated earlier, once the XSLT is a go, it's down to adding a config file entry and a server restart.

The newly added exporter is now ingesting the article collection and pushing out all articles that (a) don't have TaxPub originals (which are arguably preferable over the GG XML round-tripped version), (b) have no gatekeeper objections, and (c) transform into valid TaxPub. The (b) and (c) conditions/hindrances will show up in the data transit statistics, so we'll have a list of the documents that fail to convert (condition c), as well as ones that need a few more QC checks before we deem them ready to go (condition b).

However, isn't it the case that we decided not (or not only) to provide them with JATS TaxPub, but BIOC JSON -- or at least the JSON was the preference? The JATS would be supplementary and also added to BLR as an additional file format. The BIOC should probably be generated from the GGXML anyway, relying as it does on offsets. So in the big picture the missing piece is the GGXML to BIOC conversion, so should that not be the priority? While I'm aware of the basic BIOC structure (we talked about it March), a specification of what exactly to include would be highly appreciated ... we just had the case that full GG XML with all its details was too much for WADM, and I don't want the BIOC export to run into the same issue ...

So, let me know which annotations and attributes to include, and maybe an example (e.g. based upon a treatment with a good deal of details, like https://tb.plazi.org/GgServer/html/03A8FF2FFD36FFE31FB2EB1EFDE6F7CB or https://tb.plazi.org/GgServer/html/C63D87EE5417FFFFFDCFFB34FAA0E778)

Best, Guido

myrmoteras commented 1 week ago

Hi Donat, Sorry for this cryptic message. It says, the Guido is waiting for a new version of the XSLT to create the journal JATS that we are supposed to create and provide to SIBiLS, so we can progress with the conversion pipeline. It was not about another meeting.

We did decide to provide a JATS version, and whether we should also use a BIO-C, if we can produce and we thought to be doable.

I will schedule a meeting next Monday 5pm to discuss how to coordinate the production of JATS XML for journals.

  1. We need to progress on the creating journals JATS and export to SIBiLS
  2. We need to decide, whether we want to export directly to SIBiLS or add the files to BLR, similar to XML we do already. In this context, we need to also discuss whether we want to add RDF versions. All requires thinking about add new versions of the files in BLR if they are changed in TB.
  3. We need to create a non-email channel to communicate, and a regular meeting for this workflow.
  4. We need to write down explicit goals and next steps. Where?
  5. We need to decide who is involved: Terry, Guido, Felipe (QC and fixing errors as we do it already), Julien (import in SIBilLS), cced Patrick, Dona
  6. We need to decide who is in charge of it to get this workflow in place.

Is this ok? sure, even though possibly no longer required, as I just took the TaxPub article export live (see previous mail) ...

What we need to provide BIOC as well, please also refer to my previous mail ... which annotations to include, which attributes? We just had the GG XML overload in WADM, and I don't exactly expect BIOC to be less verbose, so using the two example treatments I linked in my previous mail to create an example BIOC with all the desired detail would be great.

Best,