isaacs / sax-js

A sax style parser for JS
Other
1.09k stars 325 forks source link

attribute fires before opentag #126

Closed Pomax closed 8 years ago

Pomax commented 10 years ago

I have the following XML

<!-- Entry for Kanji: 亜 -->
<character>
    <literal>亜</literal>
    <codepoint>
        <cp_value cp_type="ucs">4e9c</cp_value>
        <cp_value cp_type="jis208">16-01</cp_value>
    </codepoint>
    <!-- some elements omitted -->
</character>

and the following code

var fs = require('fs'),
    sax = require('sax'),
    parser = sax.parser(false);

parser.onattribute = function (attr) {
  console.log(">>> ATTRIBUTE: " + container + "(name:" + attr.name + ", value:"+attr.value+")");
};

parser.onopentag = function (node) {
  container = node.name.toLowerCase();
  console.log(">>> OPENTAG: " + container);
};

try {
  var file_buf = fs.readFileSync('kanjidic2.xml');
  parser.write(file_buf.toString('utf8')).close();
} catch(ex) { console.error(ex); }

Looking at the log, I get the following output:

>>> OPENTAG: kanjidic2
>>> OPENTAG: header
>>> OPENTAG: file_version
>>> OPENTAG: database_version
>>> OPENTAG: date_of_creation
>>> OPENTAG: character
>>> OPENTAG: literal
>>> OPENTAG: codepoint
>>> ATTRIBUTE: codepoint (name:CP_TYPE, value:ucs)
>>> OPENTAG: cp_value
>>> ATTRIBUTE: cp_value (name:CP_TYPE, value:jis208)
>>> OPENTAG: cp_value
[...]

Note that the output suggests the parse ordering for elements with attributes is "signal attribute name/value before element opening tag". Given the nature of sax parsing, where the open tag is seen before any attributes on that tag in terms of parse order, this would appear to be a bug in the event ordering.

It looks like the S.OPEN_TAG case should generate an opentag event before moving on to S.ATTRIB, but adding in an openTag(parser) before moving on causes a problem in the openTag function when it tries to access deferred attributes.

rezen commented 8 years ago

I experienced this issue as well.

isaacs commented 8 years ago

This is behaving as expected. Here's a more minimal example:

var xml = '<tag attrib="value" attrib2="value2">content</tag>'
var parser = require('sax').parser(false)

parser.onopentag = function (t) {
  console.log('opentag', t)
}
parser.onattribute = function (a) {
  console.log('attrib', a)
}

parser.write(xml)
parser.end()

Expected output:

attrib { name: 'ATTRIB', value: 'value' }
attrib { name: 'ATTRIB2', value: 'value2' }
opentag { name: 'TAG',
  attributes: { ATTRIB: 'value', ATTRIB2: 'value2' },
  isSelfClosing: false }

This parser is designed to be interruptible and emit data as early as possible. When we get to the closing " of the value, we have all the information we need to emit the attribute event. However, we haven't yet hit the > indicating the end of the open tag, so we can't emit the opentag event yet.

Do you have some reason for needing to know when a tag's < has started, before any attributes or the opentag event?

Pomax commented 8 years ago

yes: data reconstruction. Expected behaviour to me as a user of XML parsers is:

And as end tags cannot contain any attributes, there is no need for 'start-of' and 'end-of' signals. Some of these steps are obviously omitted for for self-closing elements:

Now, skipping the "start of open tag" part makes data reconstruction really weird compared to expected SAX parsing: you need to know what tag the attributes you're seeing belong to, because you want to be able to say:

var nodestack = [];
on('start-of-open-tag', data => nodestack.add(new Node(data)))
on('atribute', data => nodestack.top.addAttribute(data))
on('end-tag', data => nodestack.top.finalize().pop())
...

Lacking that, you need to build an attribute map "somewhere" outside of where it needs to end up, and then slot it in later, with all the bugs that can introduce.

A probably very easy fix here is to introduce that "start-of-open-tag" event, so that parsing can take advantage of the SAX parser signaling information as it sees it (because that's the order it sees things: open tag means it can yell "I see this tag: ... ; not done with it yet, though", then it can see attributes and signal those, then it can see > and yell "okay that's the end of the opening tag, expected other data from here on out")

isaacs commented 8 years ago

So, yeah, probably there's a hole in the API for "have the tag name, but haven't seen attributes or > yet."

Patch welcome!

Pomax commented 8 years ago

Indeed it is! I won't be writing it though, as I'm working on too many things right now already, with negative free time. So I'll be hoping that as the person who knows the code best, you know where to add the necessary bits.

isaacs commented 8 years ago

Use the opentagstart event, available in v1.2.0.

isaacs commented 8 years ago

Use the opentagstart event, available in v1.2.0.

Pomax commented 8 years ago

hero++

(it looks like the README.md still needs an tiny update though, it does not use that event in the "Usage" section)