fn:elements-to-maps: Observations

ChristianGruen commented 1 week ago

This is a placeholder for feedback on the recently added fn:elements-to-maps function.

Adopted from https://github.com/qt4cg/qtspecs/pull/529#issuecomment-1765060154 (and as also suggested by @dnovatchev), some rules still refer to JSON. I think we should refer to the XDM, XML or maps instead. Examples:

mapping XML to ~~JSON~~ a map
~~JSON~~ Map equivalent (13x) → adjust syntax
their ~~JSON~~ map equivalents
…etc

Issues that have not fully been discussed: https://github.com/qt4cg/qtspecs/pull/529#issuecomment-1765761565

https://github.com/qt4cg/qt4tests/issues/181: empty-plus shouldn't require that attributes exist.
https://github.com/qt4cg/qt4tests/issues/180: "list" incorrectly states that it doesn't apply where the INNER element has attributes.

…more to come.

michaelhkay commented 1 week ago

Adding my own notes from the discussion:

It would be useful to add a default-layout option.

With uniform=yes, I think we need to define what happens if none of the match predicates is appicable to every element - should probably fall back to "mixed" in this case.

We should try to explain the precedence rules more clearly, perhaps giving examples of how a particular layout gets chosen in particular circumstances.

There's scope for some general discussion of losslessness. What options should you choose if you want to minimise loss of information? What information gets lost unconditionally (e.g. unused namespaces). When are comments and PIs retained? Which layouts retain ordering information?

Should discuss streamability.

Do we need to say anything more about special characters and escaping?

Should there perhaps be an option that causes a dynamic error in preference to dropping information when inappropriate layouts are chosen?

There's a need for an inverse function maps-to-elements: but that's a whole new work item.

michaelhkay commented 6 days ago

See also qt4tests issues 180 and 181.

I don't think the spec explains clearly how schema-aware layout selection and uniform layout selection interact. What happens if uniform=yes is selected and a schema is available for some elements to be converted (but not necessarily all?)

michaelhkay commented 5 days ago

A response to the comments from #529, which I considered carefully but didn't respond to:

Some more remarks, after having slept over the proposal:

We shouldn’t differ between text() and text()[normalize-space()]: If a text node exists, we should treat it as such and choose the mixed layout.

No, I disagree A great deal of XML includes whitespace text nodes that are there purely for layout purposes. Including such whitespace in the result of JSON conversion would make it very messy. Significant whitespace text nodes typically exist only as siblings of non-whitespace text nodes.

From an untyped perspective, I believe that empty and simple could be merged (as well as empty-plus and simple-plus).

Possibly. But logically, there are two dimensions "has attributes = yes|no" and "has (simple) content = yes|no", and the optimum representations for the resulting 4 possibilities are all a bit different.

simple-plus: The #content key feels a bit lost. Maybe we can choose an empty string?

I think that in a query, $x?"#content" is a lot clearer to the reader than $x?"" We could of course make the choice of string an option.

If a user has chosen a layout, it feels unexpected if we resort to a fallback layout. Instead, we should rather raise an error or (better) ignore data that does not match the current layout. For the list layout, it could mean that we take the first element as reference and a) ignore subsequent elements with different element names, or b) treat all elements similar to the first one. Or, even better and easier…

Well, there are lots of options and I don't think any of them is intrinsically best. I do think it's a good idea to keep the whole function error-free, and I think that retaining data but bending the format is generally better than dropping data, though the design doesn't follow that principle everywhere.

We merge list and record and we combine duplicates. For lists, this would reflect the existing behavior; for records, this would reflect my suggestion in my earlier comment.

My main reservation here is that the semantics are very different: for lists, order matters, for records, it doesn't. The way it's currently defined, I agree, a record that contains all-duplicates ends up being handled very similarly to a list -- but not identically, and I think the differences are important.

The sequence and mixed patterns look fairly similar; maybe we can manage to merge them as well.

Having two different layouts gives the user a way to select whether they want whitespace text nodes treated as signficant or not. That's an important distinction.

I wonder what we should do with comments and processing instructions. Unless we can think of specific use cases, I would tend to simply ignore them. We could also add an option to enforce their inclusion (similar as for fn:deep-equal).

I think that keeping them for mixed content and discarding them for everything else works reasonably well. We could add more options but it adds more complexity.

The fewer patterns, the better. I'm not sure I agree. If you try various online converters you quickly find that where they deliver poor results, it's because they've made a wrong inference about the semantics of the data, and it's not difficult to refine the rules they are using to produce something significantly better.

michaelhkay commented 5 days ago

Draft PR available at #1596 - but I expect to do further work on it.

ChristianGruen commented 5 days ago

Thanks for spending time on this.

No, I disagree A great deal of XML includes whitespace text nodes that are there purely for layout purposes. Including such whitespace in the result of JSON conversion would make it very messy. Significant whitespace text nodes typically exist only as siblings of non-whitespace text nodes.

Yes, I assume that the advantages outweigh the drawbacks. I had cases like this in mind…

elements-to-maps(parse-xml('<p><b>X</b> </p>')/*)

…which (if I see it correctly) return { "p": { "b": "X" } }, where as { "p": [{ "b": "X" }, " "] } would probably be what one would expect. But I assume that the general suggestion for mixed content will be to always choose the mixed layout.

I think that in a query, $x?"#content" is a lot clearer to the reader than $x?"" We could of course make the choice of string an option.

Perhaps #text or #value? The term “content” implies to me that it could contain a nested substructure.

Well, there are lots of options and I don't think any of them is intrinsically best. I do think it's a good idea to keep the whole function error-free, and I think that retaining data but bending the format is generally better than dropping data, though the design doesn't follow that principle everywhere.

I am not sure. I certainly agree for the automatic layout choice, but when I say I want to have X, I would be very surprised to get Y (and in most cases, it would be difficult to understand why). I think we should generally take everything serious that a user requests.

ChristianGruen commented 5 days ago

PS: For mixed content, maybe we could consider xml:space='preserve'?

dariok commented 5 days ago

I gave the state of this function a try as it is currently available via the BaseX Fiddle (thanks @ChristianGruen )-

I used a bit of XML, an excerpt from a TEI file, which I used as a basis for working out how I’d serialise to JSON:

<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:id="pb004027-1007">
   <teiHeader>
   </teiHeader>
   <facsimile/>
   <text>
      <body>
         <pb n="225"/>
         <head>
            <lb/><hi style="font-weight: bold;"><w>IV</w><pc>.</pc> <hi style="font-variant-caps: small-caps;"><w>La</w> <w>Violencia</w><pc>:</pc> <w>Materia</w> <w>prima</w> <w>de</w> <w>la</w> <w>seguridad</w><!-- Zur Illustration hier
noch das Beispiel für einen Kommentar in XML--></hi></hi></head>
      </body>
   </text>
</TEI>

I have to admit that at that point, I had not yet read the spec – but the result was somewhat surprising to me so that I initially suspected a bug:

{
  "Q{[http://www.tei-c.org/ns/1.0}TEI](http://www.tei-c.org/ns/1.0%7DTEI)": {
    "teiHeader": "",
    "text": {
      "body": {
        "head": {
          "hi": {
            "hi": [{
              "@style": "font-variant-caps: small-caps;"
            }, {
              "w": "La"
            }, {
              "w": "Violencia"
            }, {
              "pc": ":"
            }, {
              "w": "Materia"
            }, {
              "w": "prima"
            }, {
              "w": "de"
            }, {
              "w": "la"
            }, {
              "w": "seguridad"
            }, {
              "#comment": " Zur Illustration hier&#xA;noch das Beispiel für einen Kommentar in XML"
            }],
            "@style": "font-weight: bold;",
            "pc": ".",
            "w": "IV"
          },
          "lb": ""
        },
        "pb": {
          "@n": "225"
        }
      }
    },
    "facsimile": "",
    "@xml:id": "pb004027-1007"
  }
}

To me, the loss of document order as a default was very surprising. While the innermost hi element is returned as I’d expect, the other element’s content is mixed up, sometimes actually reversed. The most striking effect of this is that the attributes are actually returned after (some of) the content of an element.

While the order of elements may not always be important, a loss of order by default is likely not what users would expect. I think that in this case that is, if more than one element child is present, the result should always be an array so as to preserve document order.

dariok commented 5 days ago

PS: For mixed content, maybe we could consider xml:space='preserve'?

I’d second that. It is a clear indication the the creator of the XML expects the white space to be, well, preserved.

I know it would add another step and hence complexity to choosing the model with which to convert, but it would again be a big surprise, I believe, to some using the function.

(As a side note, given my above example: I have not included xml:space="preserve" and actually I was not surprised by the lack of white space in the result.)

michaelhkay commented 5 days ago

I tried the TEI example in a couple of online XML to JSON converters, and they both produce essentially the same output. The difference is that they are generating JSON directly, rather than generating a map which is then serialized, so they give the illusion of preserving order - but it's an illusion, because once you re-parse the JSON, you get an object in which the fields have no defined ordering.

If you do want to preserve order with this example, you need to use "sequence" layout rather than "record" layout. You'll get that as the default if there are any duplicate names among the children, but you can request it manually if you want. This is what I get (edited to put the result through JSON serialization):

{ "Q{http:\/\/www.tei-c.org\/ns\/1.0}TEI":[
    { "@xml:id":"pb004027-1007" },
    { "teiHeader":"\n   " },
    { "facsimile":"" },
    { "text":{ "body":{
          "head": {
            "hi": {
              "hi": [
                { "@style":"font-variant-caps: small-caps;" },
                { "w":"La" },
                { "w":"Violencia" },
                { "pc":":" },
                { "w":"Materia" },
                { "w":"prima" },
                { "w":"de" },
                { "w":"la" },
                { "w":"seguridad" },
                { "#comment":" Zur Illustration hier\nnoch das Beispiel für einen Kommentar in XML" }
              ],
              "@style": "font-weight: bold;",
              "pc": ".",
              "w": "IV"
            },
            "lb": ""
          },
          "pb": { "@n":"225" }
        } } }
  ] }

michaelhkay commented 5 days ago

I've occasionally thought about having an order-retaining map implementation in Saxon. The effect would be that a JSON serialization of the map would give you the entries in the order in which they were added. Using such a map in the output of elements-to-maps would certainly have cosmetic benefits.

ChristianGruen commented 5 days ago

I've occasionally thought about having an order-retaining map

michaelhkay commented 5 days ago

I wonder if we should say that untypedAtomic values are output according to their lexical type - if it looks like a number, then output it as a number; but take account of uniform so if that's set, you only output an attribute as numeric if all attributes of the same name (on elements of the same name?) look numeric.

dnovatchev commented 5 days ago

It may be possible to achieve more or less "round robin" complete-360-degrees transformation, if we define this function to return in addition to the result-maps, also one special map that contains data, needed to control the way how the reverse transformation - back from sequence of maps to sequence of elements - is to be produced.

Why not?

ChristianGruen commented 4 days ago

[USER1] User feedback:

I have no idea which layout is used for my XML data. A function would be helpful that does not return the transformed data, but the layouts used for the transformation.

We could…

offer an extra function,
add an option to trace layout information, or
(my favorite) add an option to include layouts in the output:

<p><a>A</a><b>B</b><c/></p> => elements-to-maps({ 'debug': true() })

{
  "p(record)": {
    "a(simple)": "A",
    "b(simple)": "B",
    "c(empty)": ""
  }
}

ChristianGruen commented 4 days ago

[USER2] More user feeback:

It’s confusing that the following function calls lead to completely different outputs:

elements-to-maps(
  <person>
    <name>Akila</name>
    <age>34</age>
  </person>
)

{"person":{"name":"Akila","age":"34"}}

elements-to-maps(
  <person>
    <name>Akila</name>
    <name>Jaha</name>
    <age>34</age>
  </person>
)

{"person":[{"name":"Akila"},{"name":"Jaha"},{"age":"34"}]}

Maybe we could the change the rules for record from all-different(*!node-name()) to not(all-equal(*!node-name()))?

ChristianGruen commented 4 days ago

With regard to types, I would propose to introduce a separate option:

elements-to-maps(
  <value>42</value>,
  { 'types': { 'value': 'number' } }
)

→ { "value": 42 }

I have a preference for strings, as we can prefix them with @. Next, the representation could be identical to the result, which I believe is more intuitive:

elements-to-maps(
  <value count='3'/>
  { 'types': { '@count': 'number' } }
)

→ { "value": { "@count": 3 } }

dariok commented 4 days ago

@michaelhkay

If you do want to preserve order with this example, you need to use "sequence" layout rather than "record" layout. You'll get that as the default if there are any duplicate names among the children, but you can request it manually if you want.

The thing is: why mix up the document order by default? Would it really be problematic to have “sequence” as the default behaviour and retain “record” as ~the default~ an option?

If elements are in a specific order in the XML, I cannot really imagine any kind of processing that will fail or be problematic if that order is kept in the map or in JSON.

As hinted at in the second user reply quoted by @ChristianGruen in https://github.com/qt4cg/qtspecs/issues/1592#issuecomment-2493187896 above, I’m not the only one who finds the current behaviour puzzling.

michaelhkay commented 4 days ago

Would it really be problematic to have “sequence” as the default behaviour and retain “record” as the default?

Because if the names of the children are all distinct, then that usually suggests you're modelling an object and its properties, and the natural way of modelling an object and its properties in JSON is as a JSON object (=map). Moreover, that gives you the ability to access the properties by name.

Very often there's no semantic meaning in the order (there's no logical need to have the header, body, and footer of a table in that order), but there's a human expectation about readability. For example in the QT3 test suite we have:

<test-case name="elements-to-maps-200">
      <description> element node - implicit - empty</description>
      <created by="Michael Kay" on="2024-11-16"/>
      <test><![CDATA[
         elements-to-maps(parse-xml('<a/>')/a)
      ]]></test>
      <result>
         <assert-deep-eq>{"a":""}</assert-deep-eq>     
      </result>
   </test-case>

If you apply elements-to-json to that single example, it will use record layout (and therefore lose the order of elements - which loses no information, but might spoil readability). In this case, however the schema allows some of the child elements to be repeated, which means that if you process a larger sample of instances using uniform=true, or if you make the conversion schema-aware, then it will use sequence layout.

All the online XML-to-JSON tools I have tried generate an object/map for this case, and I think that's the right default. But I'm going to look again at whether there is some way of retaining a background order in a map which is used when serializing.

ChristianGruen commented 4 days ago

As hinted at in the second user reply quoted by @ChristianGruen in https://github.com/qt4cg/qtspecs/issues/1592#issuecomment-2493187896 above, I’m not the only one who finds the current behaviour puzzling.

My guess would be that in this case, the record layout would have been the best choice. But (…I agree) the user feedback I have got so far is that it’s difficult, if not impossible, to understand for users what the heuristics do as soon as they don’t deliver completely intuitive results.

Simply said, we have two types of data in XML that are to be handled completely different: structured data and mixed-content data. For mixed content, order is essential. For structured data, a compact representation is usually preferable, and order is often irrelevant.

Maybe the best default is indeed to always the mixed layout (i.e., the layout that is closest to the XML representation), but to provide and an option that enables the automatic layout choice?

ChristianGruen commented 4 days ago

All the online XML-to-JSON tools I have tried generate an object/map for this case, and I think that's the right default.

The classical order-preserving mapping for JSON is JsonML. The TEI example is returned as follows:

[ "TEI",
  { "id": "pb004027-1007" },
  [ "teiHeader" ],
  [ "facsimile" ],
  [ "text",
    [ "body",
      [ "pb", { "n": "225" } ],
      [ "head",
        [ "lb" ],
        [ "hi", { "style": "font-weight: bold;" },
          [ "w", "IV" ],
          [ "pc", "." ],
          [ "hi", { "style": "font-variant-caps: small-caps;" },
            [ "w", "La" ],
            [ "w", "Violencia" ],
            [ "pc", ":" ],
            [ "w", "Materia" ],
            [ "w", "prima" ],
            [ "w", "de" ],
            [ "w", "la" ],
            [ "w", "seguridad" ]
          ]
        ]
      ]
    ]
  ]
]

michaelhkay commented 4 days ago

Maybe the best default is indeed to always the mixed layout

I think this overlooks that when people have document-like content (like TEI) they are unlikely to want to convert it to JSON. The people who want conversion to JSON are generally dealing with the kind of structured data that JSON can handle well.

dariok commented 4 days ago

@michaelhkay It think, this assumption is not accurate.

For the purpose of training AI models, I was asked to create a JSON representation of a corpus of TEI files that is actually 22GB of TEI/XML.

Also, on the TEI-L, there has very recently been a post that circled around the question of using JSON, at least in the user-facing components in a digital edition: https://lists.psu.edu/cgi-bin/wa?A2=TEI-L;b55719c2.2411&S=

Note that I personally do not think this is a good thing (in the course of the discussion, I expressly said that JSON is not designed for that and I also tried to make the AI folks understand this) – but still, it is a requirement that’s out there.

As regards the default: I still think that the default representation should be as close to the input as is possible within the confines of the different format. A developer can then elect to use further automation if they consider that better in their circumstances.

If the current behaviour is kept as the default, the very first note for this function should be that by default, document order may be lost and what you have to do when you want it to be retained.

michaelhkay commented 3 days ago

this assumption is not accurate

I'm sorry, but one use case does not prove that. Yes, we need to cater for a wide variety of use cases, and that's why the function provides capability to override the defaults. But the principle of least surprise suggests that we should do what most of the existing converters do by default, and follow patterns such as Goessner's: https://www.xml.com/pub/a/2006/05/31/converting-between-xml-and-json.html

ChristianGruen commented 3 days ago

the principle of least surprise suggests that we should do what most of the existing converters do by default

I believe there is an important difference: most converters don't use heuristics, so you know what you get.

The initial feedback I gathered so far is that the function works fine if the input is regular and uniform, but as soon as there are slight deviations, it can get wild. Here are some plain examples how a small change to the input results in fairly different output:

<xml>
  <info>X</info>
  <address>A</address><address>B</address>
</xml>
→ { "xml": ["A", "B"] }

<xml>
  <info>X</info>
  <address>A</address>
  <address>B</address>
</xml>
→ { "xml": [{ "info": "X" }, { "address": "A" }, { "address": "B" }] }

<xml id='id0'>
  <address>A</address>
  <address>B</address>
</xml>
→ { "xml": { "@id": "id0", "address": ["A", "B"] } }

One premise in the spec is:

The JSON should be consistent and stable: small changes in the input should not result in large changes in the output.

I think we should take this one more seriously. I would still prefer having a default that is as lossless as possible, and that ensures a stable structure when elements or attributes are added. This is especially important for large datasets, where it is easy to overlook different layouts in between.

Next, I assume that uniform=true will yield better results in most cases. It seems to be mainly a performance concern why it is disabled by default, which should not become a burden for users.

I would suggest…

providing an option to enable the automatic layout choice (which is false by default, using mixed everywhere), and
enabling uniform by default.

Having said that, the general feedback on the existence of the function was very positive.

michaelhkay commented 3 days ago

I believe there is an important difference: most converters don't use heuristics, so you know what you get.

I'm not sure what you mean by that. Empirically, I think most converters are based loosely on Goessen's rules, or something very similar. But most of them don't have any documentation, so you certainly DON'T know what you will get. Most of them do a good job with very simple XML, the main thing they get wrong is things like

<section>
   <head/>
   <para/>
   <para/>
   <table/>
   <para/>
</section>

I found that many of them lose document order in that situation, whereas our rules retain it.

Noticeably missing in Goessens' rules is any discussion of whitespace or namespaces - two of the toughest things to deal with in XML. Again, most of the online converters handle those very badly.

ChristianGruen commented 3 days ago

I'm not sure what you mean by that.

I meant to say that the structure of the current results of elements-to-maps depends completely on specific properties of the input data. By looking at the result of one input document, it will be hard to imagine what will be the resulting structure of another input document.

This contrasts with dialects such as JsonML, which have clearly documented rules that are always the same; i.e., the layout never changes, no matter what input is supplied. Admittedly, the resulting representation is not very accessible.

My experience with converters (and I confess it has been a while ago) was that most of them support only a fraction of the XML data model. Often, even attributes are ignored. This is obviously no option for us, but due to the simplicity, it is rather simple to understand and predict what they return.

dariok commented 3 days ago

this assumption is not accurate

I'm sorry, but one use case does not prove that.

Note, that I deliberately used the term “inaccurate“. The question may well be how we define the “majority of cases” – but I’m quite sure that the TEI is not the smallest use case for XML and JSON out there.

But the principle of least surprise suggests that we should do what most of the existing converters do by default

Basically, there are some assumptions at work in the current default, namely

that people would prefer to access data in a certain way (with the selection based on the names of the elements or white space between them);
that document order does not matter in some cases (based on the names of elements and white space between them);
that input data are uniform in so far that the selection based on element names and white space always return the same result.

Especially without clear documentation that small changes may yield very different results, these assumptions may lead to surprises in their own right, as shown by Christian’s examples where basically identical input data sets return different results. As he said: with large data sets, such a difference can easily happen (e.g. data added automatically vs. some that were edited manually).

On the other hand, by keeping the document order by default, we eliminate assumption number two and reduce the impact of assumption number three.

Additionally, there is the assumption here that people want what’s already out there. While that may of course be, why duplicate that which is already available and thus most likely already in use?

Also, that assumption cannot be based upon the cases where the currently available tooling is used as those cases that are not covered by the current practice would either result in people writing their own code to produce the desired output or to even abandon the approach outright.

In conclusion, I still think that even when applying “least surprise”, a case can well be made for a change of the default behaviour (and, going along with that, a clear documentation what the different assumptions mean and what potential pitfalls there are).

michaelhkay commented 3 days ago

JsonML seems to have a rather different purpose from this function, but it's something we should look at. However, I'm having trouble finding a spec. The web page at jsonml.org skirts around the subject, but I can't find information on how it handles the thorny issues of whitespace and namespaces. (In the examples, whitespace in the XML is ignored completely - that can't be right, surely?)

ChristianGruen commented 3 days ago

JsonML seems to have a rather different purpose from this function, but it's something we should look at. However, I'm having trouble finding a spec.

Some more information can be found at http://www.jsonml.org/xml/. From our QT4CG point of view, it’s certainly sketchy.

(In the examples, whitespace in the XML is ignored completely - that can't be right, surely?)

When whitespace exists, JsonML treats it as ordinary text.

Maybe I have a different perspective on processing whitespace, due to our database focus: We advise users to strip irrelevant whitespace as early as possible (e.g. during parsing XML). None of the bidirectional JSON mappings that we support have special rules for whitespace text nodes (all except JsonML are JSON-centric, though).

Here is a survey of XML/JSON mappings that was often referenced when I spent more time on the topic (10 years ago?):

https://wiki.open311.org/JSON_and_XML_Conversion/

michaelhkay commented 3 days ago

We advise users to strip irrelevant whitespace as early as possible (e.g. during parsing XML)

Reminds me that we still don't have options on doc(), collection(), or parse-xml() to enable stripping of whitespace during parsing. XSLT of course has xsl:strip-space but it doesn't apply selectively to different documents.

michaelhkay commented 3 days ago

It occurs to me that one additional option we might consider is disable-layouts=(layout names). For example, if simple were disabled, you would have to use simple-plus even when there are no attributes; similarly disabling record would force use of sequence.

michaelhkay commented 3 days ago

why duplicate that which is already available and thus most likely already in use?

Because "what's already available" is NOT available in the context of XSLT and XQuery.

qt4cg / qtspecs

fn:elements-to-maps: Observations #1592