netvl / xml-rs

An XML library in Rust
MIT License
459 stars 110 forks source link

You've found a bug in xml-rs, caused by calls to push_pos() #227

Closed acuifex closed 1 year ago

acuifex commented 1 year ago

Hi!

I'm doing what the message told me. Note that I'm not a rust dev and i have no idea what I'm doing.

I'm messing around with this project: https://github.com/NeKzor/lp serde-xml-rs has a similar issue: https://github.com/RReverser/serde-xml-rs/issues/205

thread 'main' panicked at 'You've found a bug in xml-rs, caused by calls to push_pos() in states that don't end up emitting events.
            This case is ignored in release mode, and merely causes document positions to be out of sync.
            Please file a bug and include the XML document that triggers this assert.'

Package versions pulled from Cargo.lock: serde-xml-rs: 0.6.0 xml-rs: 0.8.15

url from where the xml was pulled xml file itself

NeKzor commented 1 year ago

This is an interesting bug which I did not encounter two years ago.

Here's a minimal reproducible example:

use std::io::BufReader;

use xml::{
    reader::{EventReader, XmlEvent},
    ParserConfig,
};

fn main() -> std::io::Result<()> {
    let data = r#"<root>
    <item><![CDATA[]]></item>
    <item><![CDATA[]]></item>
    <item><![CDATA[]]></item>
    <item><![CDATA[]]></item>
    <item><![CDATA[]]></item>
    <item><![CDATA[]]></item>
    <item><![CDATA[]]></item>
    <item><![CDATA[]]></item>
    <item><![CDATA[]]></item>
    <item><![CDATA[]]></item>
    <item><![CDATA[]]></item>
    <item><![CDATA[]]></item>
    <item><![CDATA[]]></item>
    <item><![CDATA[]]></item>
</root>"#;

    let config = ParserConfig::new().cdata_to_characters(true);

    let reader = BufReader::new(data.as_bytes());
    let parser = EventReader::new_with_config(reader, config);

    let mut depth = 0;
    for e in parser {
        match e {
            Ok(XmlEvent::StartElement { name, .. }) => {
                println!("{:spaces$}+{name}", "", spaces = depth * 2);
                depth += 1;
            }
            Ok(XmlEvent::EndElement { name }) => {
                depth -= 1;
                println!("{:spaces$}-{name}", "", spaces = depth * 2);
            }
            Err(e) => {
                eprintln!("Error: {e}");
                break;
            }
            _ => {}
        }
    }

    Ok(())
}

Turns out that the panic occurs after exactly 14 empty CDATA values with cdata_to_characters enabled. I really wonder why that is...

NeKzor commented 1 year ago

I bisected now all older versions where it still worked and found the commit that introduced the regression: https://github.com/netvl/xml-rs/commit/444a7c2535d362cc71d9461ffb36d3197fab3100

Not sure what this change does but the hard-coded 16 value there is very close to the magic number 14 which makes the event reader panic.

kornelski commented 1 year ago

This issue has always been happening, but instead of being caught, it was silently causing incorrect line numbers and ever-increasing memory usage.

The max depth for push_pos should be finite, because it's a stack of XML syntax constructs being parsed (like element has an attribute, attribute has an entity, and then the XML syntax doesn't go any more detailed than that).