tafia / quick-xml

Rust high performance xml reader and writer
MIT License
1.18k stars 235 forks source link

Parsing malformed closing tags #776

Closed evilpie closed 2 months ago

evilpie commented 3 months ago

For Ruffle it would be useful if we could parse these two kinds of malformed XML:

<a></a/>

From https://github.com/ruffle-rs/ruffle/issues/15674. Note the extra /.

<a></a hello="world">

From https://github.com/ruffle-rs/ruffle/issues/16862. Note the attribute in the closing tag!

Excuse me if I missed a config option for this.

Mingun commented 3 months ago

I think, that should be possible today, because we do not check (yet. I plan to add optional validate method to events) content of the start / end events.

To be clear, both should be recognized as Event::End. You should disable end name validation because names would be a/ and a hello="world".

dralley commented 3 months ago

It's literally possible to construct such an event, however there's no way to get the attributes from it unless you parse them yourself. BytesEnd doesn't provide any mechanism for this.

Looking at the documentation for BytesEnd: https://docs.rs/quick-xml/latest/quick_xml/events/struct.BytesEnd.html

We do give an example of creating such an event (although I think this isn't a great idea because it is currently our only example - we should not have the only example provided be an instance of nonconformant XML).

But in this example the entire string, including the attributes, is returned when you call .name(). I'm guessing that probably is not what you want @evilpie ?

evilpie commented 3 months ago

We don't actually care about the attributes or anything else in the closing tag. I just found about the config option check_end_names. Setting this to false almost gives us the Flash behavior.

However Flash treats <a></a foo="bar"> and <a></a/> as just <a></a>, but it will throw for <a></b>, because the tags don't match ... The config option also allows the last as well, so we can't use it directly.

Mingun commented 3 months ago

I know only one example which we do not support right now:

</a foo=">">

will be parsed as

Event::End(BytesEnd::new("a foo=\""));
Event::Text(BytesText::from_escaped("\">"));

Do you need a mode that will parse this as

Event::End(BytesEnd::new("a foo=\">\""));

?

evilpie commented 2 months ago

Oh no. Flash does parse it ...

var xml = new XML('<a></a foo=">">')
trace(xml.toXMLString()) // <a/>

It does seem very unlikely to happen, so if that specific case would be very hard we might be able to skip it.

Mingun commented 2 months ago

No, that is simple to support. I think, we even may always parse this in such way, because anyway the tag name cannot contain neither " or ' and any compliant parser will throw an error anyway. I plan to add an optional validate abilities which will reject such XMLs in any case no matter as we would parse it without validation.

Mingun commented 2 months ago

With release 0.36.0 it is possible to parse such XMLs without surprises.