plazi / ggxml2taxpub

Conversion of GoldenGATE XML to JATS/TaxPub at treatment level
0 stars 1 forks source link

Materials vitation validation problem #43

Open gsautter opened 2 years ago

gsautter commented 2 years ago

@tcatapano looks as though there is a validation problem with marking materials citations in TaxPub treatments, but not including the details ... https://tb.plazi.org/GgServer/taxPubL1/038187E51607FFA6FF09E976FF5BF838 appears to be perfectly fine, but comes up as invalid anyway, and I think the reason is that tp:material-citation does not accept textual content.

A possible solution for the plain materials citations is to enclose the whole string in an extra named-content element with content-type set to dwc:verbatimLabel ... redundant, but it'd most likely fix this one.

A bigger problem will arise at higher detail levels, when we start including the detail annotations ... the details proper transform into named-content just fine, but the punctuation marks in between these details will cause the same validation problem described above, as will any interspersed plain text ... dropping all the plain text portions might solve the validation problem, but on the other hand removes any way of using the TaxPub treatments as training data for a materials citation tagger or parser, simply because it would render recovering the plain materials citation string irrecoverable and hence thwart any meaningful use as training data.

Yet another problem arises from how to represent the implied details that result from resolving phrases like "same data as holotype" against preceding materials citations ... One way of representing this might be to only ever represent materials citations as a single named-content with type dwc:verbatimLabel, and adding a sibling below the same tp:material-citation to act as a container for the parsed and normalized details ... not sure if TaxPub can handle that approach in its current status, though ...

tcatapano commented 2 years ago

@gsautter: I'm surprised by this. In the TaxPub DTD you are using, is this not the content model of material-citation:

<!ELEMENT   tp:material-citation        (#PCDATA | named-content | tp:collecting-event  | object-id | tp:type-status | tp:material-location | tp:taxon-name | xref)*

That model allows a mix of text and other elements or just text.

gsautter commented 2 years ago

Thanks ... seems like the server component really uses the wrong DTD (downloaded from Pensoft) ... the validation error I'm getting from that is this:

The content of element type "tp:material-citation" must match "(named-content|tp:collecting-event|object-id|tp:type-status|tp:material-location|tp:taxon-name|xref)"

Can you give me a link to the (obviously more permissive) version of the DTD you mention above?

tcatapano commented 2 years ago

The "latest" release of TaxPub is here:

https://github.com/plazi/TaxPub/tree/v1.0-gamma

(point to https://github.com/plazi/TaxPub/blob/v1.0-gamma/tax-treatment-NS0-v1.dtd)

and the latest release candidate of what will the next version is here:

https://github.com/plazi/TaxPub/tree/v1.0.0-rc2

(point to https://github.com/plazi/TaxPub/blob/v1.0.0-rc2/tax-treatment-NS0-v1.dtd)

Either should have the updated material-citation model.

gsautter commented 2 years ago

Thanks a lot, this is exactly what I've been looking for ... will the latest release always be under the same URL?

gsautter commented 2 years ago

While testing validation against the URL provided version of the TaxPub DTD (still thanks for the link), I encountered one error: JATS-mathmlsetup1.ent doesn't seem to exist in the repo folder you point me to, and none of its subfolders, either ... is this an oversight during prior-version cleanup, or a missing repo file?

tcatapano commented 2 years ago

@gsautter see: https://github.com/plazi/TaxPub/issues/53#issuecomment-1103665800.

That is, download the "official" JATS from https://ftp.ncbi.nih.gov/pub/jats/publishing/1.1/JATS-Publishing-1-1-MathML3-DTD.zip

and place these files from https://github.com/plazi/TaxPub/tree/v1.0-gamma

tax-treatment-NS0-v1.dtd
taxpubcustom-classes-NS0-v1.ent
taxpubcustom-elements-NS0-v1.ent
taxpubcustom-mixes-NS0-v1.ent
taxpubcustom-models-NS0-v1.ent
taxpubcustom-modules-NS0-v1.ent

alongside the downloaded JATS files and validate against tax-treatment-NS0-v1.dtd

Does that work?

gsautter commented 2 years ago

That is, download the "official" JATS from https://ftp.ncbi.nih.gov/pub/jats/publishing/1.1/JATS-Publishing-1-1-MathML3-DTD.zip

and place these files from https://github.com/plazi/TaxPub/tree/v1.0-gamma

tax-treatment-NS0-v1.dtd
taxpubcustom-classes-NS0-v1.ent
taxpubcustom-elements-NS0-v1.ent
taxpubcustom-mixes-NS0-v1.ent
taxpubcustom-models-NS0-v1.ent
taxpubcustom-modules-NS0-v1.ent

alongside the downloaded JATS files and validate against tax-treatment-NS0-v1.dtd

Does that work?

It does, see also my comment in https://github.com/plazi/TaxPub/issues/53

However, this kind of surprise and need for the mentioned workaround is lurking for each and everyone attempting to use TaxPub straight from its home repo ... to add to the confusion, the TaxPub repo does contain a good bunch of the required .ent files, so having only a few of them missing is especially counter intuitive.

Another option would be to somewhere prominently state that the TaxPub repo is an extension to the JATS DTD, all of whose files are available at some URL that is linked to right in that very explanation (or maybe something like this is already in place and I simply missed it).

gsautter commented 2 years ago

The above approach (downloading JATS from https://ftp.ncbi.nih.gov/pub/jats/publishing/1.1/JATS-Publishing-1-1-MathML3-DTD.zip and adding the TaxPub specific files from the repo) doesn't work out of the box, either ... looks as though the NCBI provided JATS ZIP has a few MathML issues in itself, namely seeking JATS-mathmlsetup1.ent while the ZIP only contains JATS-mathml3-mathmlsetup1.ent ... Nothing a renamed copy of the latter couldn't fix, but it definitely is an issue for people who cannot simply switch their DTD entity resolver to verbose mode and track the errors step by step ...

Not our issue, but an upstream one that affects us, any anyone who tries to validate their TaxPub XML ... so maybe we should make the repo self-contained after all, if only to be able to provide the extra files under the required names.