Questions about event-drive api

aeb-dev commented 2 years ago

I have the following xml (Change file type from zip to xml, github does not allow uploading xml files 😞 ):

Q1:

I have the following code:

final s = file.openRead().transform(Utf8Decoder()).toXmlEvents().toXmlNodes();

await for (final e in s) {

}

The resulting Stream only yields one item with nodeType XmlNodeType.ELEMENT which is element map. Should not it also yield other elements inside map?

Q2:

With the same file and following code:

final s = file.openRead().transform(Utf8Decoder()).toXmlEvents().normalizeEvents();

The resulting Stream yields a lot of empty XmlTextEvents is this normal?

Q3

This question is more like an opinion. With the same file and following code:

final s = file.openRead().transform(Utf8Decoder()).toXmlEvents().selectSubtreeEvents((value) => true);

The resulting Stream both contain XmlTextEvent, XmlStartElementEvent and XmlEndElementEvent. The signature and the name of the function put me under the expectation of not expecting anything other than XmlStartElementEvent since the predicate passes only XmlStartElementEvent and filters based on that. I can understand XmlEndElementEvent assuming every Start has an End but XmlTextEvent feels off.

Q4

With the same file and the following code:

final s = file.openRead().transform(Utf8Decoder()).toXmlEvents().selectSubtreeEvents((value) => true).toXmlNodes();

Since selectSubtreeEvents gives XmlStartElementEvent and XmlEndElementEvent I was expecting the resulting Stream to yield XmlNodes that are encapsulated by the Start and End

With all these question I feel like I am missing some part of the puzzle, therefore I could not understand it. However, I have been on this for a while and could not grasp it. Feel free to ask more information if some part is not clear. Sorry if it is too long.

renggli commented 2 years ago

Thanks a lot for the example file, this makes it really easy to reproduce the questions!

A1

toXmlNodes() converts the events to DOM nodes. If the converter encounters a start-event, it has to read until the corresponding end-event to be able to create a complete DOM node that contains all its children.

A2

The text nodes are not empty, they all contain one or more (possibly significant?) whitespaces. You could filter them, if you are only interested in the nodes that contain actual text:

final stream = file.openRead()
    .transform(Utf8Decoder())
    .toXmlEvents()
    .normalizeEvents();
await for (final events in stream) {
  events
      .whereType<XmlTextEvent>()
      .where((event) => event.text.trim().isNotEmpty)
      .forEach(print);
}

A3

Sorry if the documentation is unclear (always happy to get pull requests that improve it). The idea with selectSubtreeEvents is that you get all the events below a node your are interested in to possibly build DOM nodes with toXmlNodes(). If you are only interested in start and end events you can easily filter everything else away.

A4

It does, but since your predicate matches on the root element <map ... of your file, no other sub-tree is selected and you get the DOM of your whole root node. If you would select a repeated element deep within your document, you would get multiple subtrees, i.e.

final stream = file.openRead()
    .transform(Utf8Decoder())
    .toXmlEvents()
    .selectSubtreeEvents((node) => node.name == 'tile')
    .toXmlNodes();
await for (final events in stream) {
  events.forEach(print);
}

aeb-dev commented 2 years ago

Thanks a lot for the example file, this makes it really easy to reproduce the questions!

Happy to hear that. I was afraid to be misunderstood :) Also, thank you very much for fast and detailed answers.

A1 toXmlNodes() converts the events to DOM nodes. If the converter encounters a start-event, it has to read until the corresponding end-event to be able to create a complete DOM node that contains all its children.

I believe streaming api should let me consume the xml, event by event, from start to end, or vice versa. For example with the following code:

final stream = file.openRead().transform(Utf8Decoder()).toXmlEvents().toXmlNodes();

Would not this always produce the root element then? The name toXmlNodes feels like I will receive every node. Now, I know that if I want to receive everything in a streaming way I should get them from leaf to root. But note that, this approach let's you do whole scan element by element in a single traverse. Would that make sense to you?

Also, the name could be toXmlElements since XmlNodeType has a lot of types.

A2 The text nodes are not empty, they all contain one or more (possibly significant?) whitespaces. You could filter them, if you are only interested in the nodes that contain actual text:

What causes them? When I look at the file I do not see anything that should produce that?

A3 Sorry if the documentation is unclear (always happy to get pull requests that improve it). The idea with selectSubtreeEvents is that you get all the events below a node your are interested in to possibly build DOM nodes with toXmlNodes(). If you are only interested in start and end events you can easily filter everything else away.

After playing with it, it clicks. The way I expect it could be flawed, as well. That is just my opinion.

A4 It does, but since your predicate matches on the root element <map ... of your file, no other sub-tree is selected and you get the DOM of your whole root node. If you would select a repeated element deep within your document, you would get multiple subtrees, i.e.

After your answers, this makes sense within the implemented context but I think this relates a lot with the Q1 and the way I expect the streaming to work.

renggli commented 2 years ago

I believe streaming api should let me consume the xml, event by event, from start to end, or vice versa.

Events are a flat sequence of items. In your file this is an XML declaration event, a text event, a start element node, a text event, another start element node, etc.

Nodes are a forest of trees. In your file this is the XML declaration, a text node, an element node (with many other nodes as children), and another text node.

Would not this always produce the root element then? The name toXmlNodes feels like I will receive every node. Now, I know that if I want to receive everything in a streaming way I should get them from leaf to root. But note that, this approach let's you do whole scan element by element in a single traverse. Would that make sense to you?

It does produce the 4 root elements of your XML file: a XML declaration, a text node, an element node, and another text node. If you wanted to traverse into the descendants of the nodes you could always do so:

await file.openRead()
    .transform(Utf8Decoder())
    .toXmlEvents()
    .toXmlNodes()
    .expand((nodes) => nodes)
    .expand((node) => node.descendants)
    .forEach(print);

Also, the name could be toXmlElements since XmlNodeType has a lot of types.

That wouldn't make sense, because these are not necessarily just elements (see above).

What causes them? When I look at the file I do not see anything that should produce that?

Newlines and indention spaces between the tags.

aeb-dev commented 2 years ago

I understand what you mean but could not figure out how to do the following in a single traverse:

Using the same example file, imagine that map is a class as TmxMap and it has tiles so TmxMap has a field that represents tiles and so on. Now this map file could be very big so it should not be loaded to memory as whole.

You can check the models here if you like: https://github.com/aeb-dev/tmx_parser/blob/main/lib/src/tmx_map.dart The current version loads everything into the memory, I want to change it to event driven. I think I can make it if I traverse the file multiple times, however how would I do it with a single traverse?

renggli commented 2 years ago

I see two ways to go about this:

Build some kind of a state machine that reads over all the events and builds the objects on the fly. This is the traditional way of how SAX parsers are used and you can find many tutorials and/or examples on the internet (1, 2).
Build around Dart streams, possibly with the support of something like async or RxDart. Out of the box streams unfortunately make it quite hard to do complicated/branching data processing you need here. Possibly this library could provide some better support for such tasks (i.e. a version of selectSubtreeEvents that splits the stream into two streams), but then I also don't want it to become a generic stream extension library.

aeb-dev commented 2 years ago

Thanks for the explanations and tips, let's how it goes.

renggli / dart-xml

Questions about event-drive api #150

Q1:

Q2:

Q3

Q4

A1

A2

A3

A4