plazi / ggxml2taxpub

Conversion of GoldenGATE XML to JATS/TaxPub at treatment level
0 stars 1 forks source link

analyze errors in non-taxonomic tp articles #58

Open tcatapano opened 1 year ago

tcatapano commented 1 year ago

see TaxPub files in directory level1/articles/non-tax

and the Oxygen validation error report at errors/non-tax_errors_20230611.txt

many of these are <sec>'s without <title>'s and characters not wrapped in <p>'s, often in the article's preliminary material.

the files were generated by the xslt transform xslt/gg2jats-article_l1.xsl from the sources in /ggxml/articles/non-tax

tcatapano commented 1 year ago

see frequency list of validation errors at [errs/non-tax_errors_20230611_frq.txt](https://github.com/plazi/ggxml2taxpub/blob/master/errs/non-tax_errors_20230611_frq.txt}

top 5:

tcatapano commented 1 year ago

helpfully per @gsautter regarding structure of source XML:

Every paragraph should be inside either a subSection or a treatment and subSubSection in normal GG-XML ... some with an intermediate caption, footnote, or bibRef ... long as those cases are covered, we should be OK.

one more element that can sit between a paragraph and its parent subSection or subSubSection: keyStep, marking a step of a taxonomic key that consists of two (rarely 3) keyLeads, each of which is its own paragraph ...

The general statement holds, though: apart from the MODS header, the whole document is segmented into paragraphs with no gaps in between, and above that where's an overlay of subSections and treatments, likewise without gaps between them. In between these two layers, there can be further structural elements like subSubSections, bibRefs, captions, footnotes, or keySteps (the latter possibly inside a subSubSection), each containing of one or more whole paragraphs ... all the detail markup is inside the individual paragraphs.

The core idea of this rather verbose skeleton is that the gizmos that create the markup are as independent as possible: why should a tagger for geo-coordinates need to know what a materials citation or treatment is? The paragraphs are that one extremely generic element all gizmos use either (a) as the basis for aggregating larger units (taggers for subSections and treatments, for bibRefs, for treatment subSubSections, etc.) or (b) as fixed boundaries not to overshoot with the details they mark (taggers for taxon names, for citations of all sorts, for coordinates, etc.).