metafacture / metafacture-core

Core package of the Metafacture tool suite for metadata processing.
https://metafacture.org
Apache License 2.0
69 stars 34 forks source link

XML Attributes and Element values #379

Closed TobiasNx closed 2 years ago

TobiasNx commented 3 years ago

As the example in https://github.com/metafacture/metafacture-core/issues/377#issue-936955906 also shows attributes and elements of XML are all reconstructed as subfields. There is no documentation on this. Also if again encoded in XML the "new" structure is kept and the attributes are only kept as subfields.

This specific handling of xml should be documented. Also: Is there any way to reconstruct this correctly or at least build XML with attributes in metafacture?

In:

<roleTerm authority="marcrelator" type="text">Author</roleTerm>

FLUX

infile
| open-file
| decode-xml
| handle-generic-xml
| encode-xml
| write(FLUX_DIR + "result.xml")
;

[Same if you use a morph with _elseNested]

Out:

<roleTerm>
    <authority>marcrelator</authority>
    <type>text</type>
    <value>Author</value>
</roleTerm>
blackwinter commented 2 years ago

Probably the same underlying issue as #336 (part 2). XML attributes are decoded into literals, so they can't be distinguished from actual elements downstream.

The XML decoder and encoder would have to agree on a way to preserve this information (similar to what JSON decoder and encoder do for array fields). Maybe @<name> (which would require escaping in the morph) or <name>@ or something like that (and ideally configurable).

TobiasNx commented 2 years ago

The usual convention is that the value in between the two tags of an XML-field is always transformed in the literal named value and the attributes become literals with the same name as the attributes. Attributes and value are combined in one entity.

The two problems here are: 1) There is no documentation of this transformation of values and attributes. This sh ould be a quick fix.

2) The xml-encoder can't reconstruct the "old" structure. It should understand - at least optional - that elements called "value" should be the field values and the attributes are all the other fields in on entity. @blackwinter isn't this the convention you are looking for? Catmandu does something similar but using the fieldname content instead of value.

<roleTerm authority="marcrelator" type="text">Author</roleTerm>

->

role Term:
   authority: marcrelator
   type: text
   value: Author

Some transformation changes thevalue: Author to Creator and the value of authority to greatVocab:

->


role Term:
   authority: greatVocab
   type: text
   value: Creator

The encoder then should be able to transform to:

<roleTerm authority="greatVocab" type="text">Creator</roleTerm>

blackwinter commented 2 years ago

The xml-encoder can't reconstruct the "old" structure. It should understand - at least optional - that elements called "value" should be the field values and the attributes are all the other fields in on entity.

Yes, I guess this implicit attribute handling should be possible as well: Treat all literals as attributes, except those named value (ideally, the "value" literal name would be configurable).

But SimpleXmlEncoder already has the concept of an attribute marker (~), it's just that GenericXmlHandler doesn't emit it (and it's hard-coded).

blackwinter commented 2 years ago

This also mainly (only?) applies to streams produced by GenericXmlHandler. Not sure about the other XML handlers. And non-XML input streams that are to be encoded as XML output.

dr0i commented 2 years ago

Reopened and assigned @TobiasNx for functional review.

TobiasNx commented 2 years ago

Also there seems to be an sever (?)API break with the new handling of attributes and values if not setting any option at all!!! The value tags are lost by default and it seems that some kind of other handling is different now too:

https://github.com/TobiasNx/notWorkingFlux/commit/9fdffea8fdc4dc7a8bc23ec4d8843690d978d33e?branch=9fdffea8fdc4dc7a8bc23ec4d8843690d978d33e&diff=split

Shouldn't be the default settings stay the same.

My initial request that there needs to be documentation about the handling of xml in metafacture and that it decodes/handles them as "fields" is still needed.

Also I did not see documentation on the attributeMarker but the testcases. Do I miss this?

blackwinter commented 2 years ago

Shouldn't be the default settings stay the same.

Yes indeed, this is a side effect of d6e68ff. @dr0i: Was this intentional? I certainly missed it :( (Initially SimpleXmlEncoder.DEFAULT_VALUE_TAG = "", now DefaultXmlPipe.DEFAULT_VALUE_TAG = "value")

TobiasNx commented 2 years ago

Pascal and I teamed up and fixed this:

https://github.com/metafacture/metafacture-core/pull/406

katauber commented 2 years ago

Closed with #406