plazi / arcadia-project

2 stars 1 forks source link

documentation of GGI XML terms #211

Open myrmoteras opened 1 year ago

myrmoteras commented 1 year ago

@tcatapano @punkish @slint @lnielsen

We need to document the GGI XML terms to know which ones to use, when, the parents and children, the syntax, spelling etc.

What is the best way to do so?

Examples are

Can you please advise?

tcatapano commented 1 year ago

An important issue to deal with is that GGI-XML is open-ended and not formally defined by a schema. One can generate a de facto schema from actually existing instances, but it will vary depending on what is in set of source instances, and will permit wide variability, distorted by outlier patterns . A better approach might be a posteriori rather than a priori: defining and documenting the patterns which consumers of GG-XML rely on. These could be probably expressed in Schematron.

gsautter commented 1 year ago

One can generate a de facto schema from actually existing instances, but it will vary depending on what is in set of source instances

Well, that should be fairly stable at leas for the basic structural elements document, subSection, treatment, subSubSection, caption, footnote, bibRef, and paragraph, as well as table, tr, and td inside paragraph. and mods:* atop the document ... in fact, the "big batch" we've been planning since last year among other things is intended to pretty much enforce the above nesting.

@tcatapano, you are right in that the remaining elements will fluctuate and vary, pretty much dependent upon the semantics we are (or any user is) modelling, and upon how these semantics express in the source document. In a sense, the content we are marking dictates the structure of the markup, not the other way around, as we apply the markup a posteriori to publications created by third parties. As you say, SchemaTron should be a nice way of checking this basic structure, and the QC rules in fact do something very similar. The closed world approach of an XML schema just doesn't lend itself too well to content whose structure we don't control, and not to the rather frequent addition of new elements (and even more attributes) for marking new types of detail content, either.

tcatapano commented 1 year ago

Yes, its not like there is completely wide variance in the major structural elements. The names are regular (it's not like there are elements named para and p meaning the same as paragraph, etc...). Also tables generally follow the html table model and the document metadata uses mods, with a fairly regular profile. We could define and document a core element dictionary and a "skeleton" tagset with the usual patterns. Perhaps even provide some schematron schemas. We should add this to the agenda to the sprint. I'd spend time on it.

gsautter commented 1 year ago

Even if we end up with some irregularities in the element naming, we can always use the server side batches for a swift mass cleanup.

punkish commented 1 year ago

As someone who depends upon not just a stable set of tags, but also on knowing what each of those tags stand for, it would be very desirable to have some kind of a canonical data (tag) dictionary. I realize that the "schema" is open-ended, and new tags can and will be added, but at least knowing what tags exist "so far" (so far with respect to any point in time) would be tremendously helpful.

To some extent, https://test.zenodeo.org provides the available "tags" and their definitions and descriptions as the columns in the Zenodeo db map exactly to the XML tag names.

@gsautter very helpfully provided me with an exhaustive list of all the tags and their attribs used so far, but that has a lot of historical tags with spelling errors or ones that were used only once or twice, and hence, are not very useful, at least to me. For me, also, the HTML tags are not of much use. I am interested only in the semantically meaningful (non-HTML) tags.

Most of the meaningful tags are available in every XML at the very top in the document tag. Perhaps that could be firmed up as a way to go forward. That is, whenever a new tag is used, it is also added to the top as an attribute of the document tag thereby providing an officially "blessed" place to look for things that matter in any document.

Happy to help in this endeavor.

myrmoteras commented 1 year ago

This request is very pragmatic originally: we need foremost a document where we can refer to when we have a question about the spelling, available terms. We need this for our training, and internally, to avoid errors.

this goes along the recommendation that we just published regarding annotation of texts, and what vocabulary to use for example for subSubSection- types. https://doi.org/10.3897/rio.8.e97374

This goes along another request to add pull down menus in GGI for a minimal set of terms https://github.com/plazi/ggi/issues/293

This goes along the big batch where Guido will try to clean up renegate terms and come up with fewer alternative spellings.