tafia / quick-xml

Rust high performance xml reader and writer
MIT License
1.21k stars 236 forks source link

Selectively deserializing XML elements with serde #165

Closed andreivasiliu closed 4 years ago

andreivasiliu commented 5 years ago

Hi!

I want to parse a 3MB XML file with serde. I used serde-xml-rs, and it is painfully slow; I hacked my way through serde-xml-rs's code to make it work with quick-xml instead (the APIs are very similar after all), and that sped it up tremendously (from 1.5s to 0.1s).

But! I don't want to deserialize an entire 3MB XML into a giant struct (which has loads of small heap-allocated vecs inside it), when all I want is to scan for a specific element inside it, and deserialize just that one.

I thought of using quick-xml's events to reach the element I want, then read_to_end() to get the whole element as a big text block, and then use serde-xml-rs to parse the text block as xml; except this approach loses all namespacing/encoding info.

I also thought of implementing some sort of from_element(xmlreader, start_element), which would give a partial Deserializer object, which is my current favorite option.

Thoughts? Any better ways to do this?

tafia commented 5 years ago

I hacked my way through serde-xml-rs's code to make it work with quick-xml instead (the APIs are very similar after all), and that sped it up tremendously (from 1.5s to 0.1s).

Awesome!

As for your question I usually deserialize things manually. Usually it is not too complicated to do (I tend to use internal States to know in which "level" i'm in). It IS much more boilerplate than using an automatic deserializer. On the other hand it is usually much faster/uses less memory.

Doing automatic serde-like deserialization is very nice too! But getting all the use cases right is complex. If I were you I'd probably open a PR on serde-xml to add a feature to use quick-xml instead of xml-rs. Not sure if they'd want to merge it but you'll probably get some idea about improving your code from there.

Anyway, let me know what you finally decide!

andreivasiliu commented 5 years ago

Doesn't look like they'd agree: https://github.com/RReverser/serde-xml-rs/issues/61

I've asked again though: https://github.com/RReverser/serde-xml-rs/issues/108

andreivasiliu commented 5 years ago

Meanwhile, just flushing some ideas I had, you can ignore me...

My first thought was to make a read_to_end()-like method that returns an object that implements Deserializer. That meant putting serde code inside quick-xml though, which is not nice. Or maybe it is? Not sure.

My second thought was making read_to_end() return the string it read; looks like inside of it, it keeps clearing the given buffer, but with a flag that could be made optional. Then I could parse that buffer with something else (e.g. serde-xml-rs). Problem is that it's basically being parsed twice, and the buffer might become huge.

My third thought was to make use of what https://github.com/tafia/quick-xml/issues/161 is bringing, but that'd still mean parsing the text twice. Also, possibly seeking back in the reader, which might be hard with e.g. streams.

My fourth thought was to make read_to_end() return a special object that implements Read, which could be used as an input for anything (including the current xml-rs/serde-xml-rs); problem is, I don't think it would be a valid XML, since it'd be missing namespace information from outer elements.

My fifth thought was to make a read_to_end() method that returns a sub-XmlReader, which would borrow the reader, namespace info, etc, and whose .next() would stop at end event of its given element. Since serde-xml would (eventually) implement Deserializer for XmlReader, that'd make the sub-XmlReader deserializable too. Problem is, that's the only thing it'd ever be useful for, and I'm not sure it's worth maintaining just for that.

Also, I think some of the above suffer from forgetting about namespace information from outer elements, so I can ignore them outright.

tafia commented 5 years ago

I think the first idea is probably the best one, again if the purpose is to have something as competitive as possible.

I would probably have a Deserializer which mutably borrows the Reader. I am not a serde expert at all so I don't know all the details. We could definitely have lot of inspiration from serde-xml project.

andreivasiliu commented 5 years ago

So, speed at any cost? You are a man after my own heart ❤️

I'll wait some more for a reply from serde-xml-rs, I'm not in any hurry, although I'm definitely not dropping this.

tafia commented 5 years ago

Haha, no, not at any cost.

My position is more that we shouldn't take shortcuts when we can avoid it. We should aim for code that is as efficient as possible, while being reasonably simple to reason about.

tafia commented 5 years ago

Just to let you know, I've just started implement a Deserializer. I am not serde expert so things can go wrong. Having serde-xml should help a lot!

andreivasiliu commented 5 years ago

Here's the generic Deserializer I made long ago, compatible with both quick-xml and xml-rs: https://github.com/andreivasiliu/serde-xml-rs/commits/quick-xml

Looks like it still works, just needed a couple tweaks to work with quick-xml 0.16. I haven't touched it since then so I don't remember much about it, but I think it was pretty complete, and was just missing the serializer part. It's probably a bit behind compared to the main serde-xml-rs repo.

Feel free to use parts (or the whole) of it if you want!

elibenporat commented 4 years ago

I would love to use this/see this happen. I'm working on a baseball data scraper which parses a LOT of XML files. I've built it all using serde-xml-rs, so would be really happy if I could just swap out xml-rs for quick-xml without losing serde support and get a free performance boost.

tafia commented 4 years ago

I am working on a port of serde-xml into quick-xml. I have a first working version of deserializer in serde branch (for now under the serialize feature). I would love having some tests if you don't mind. Thanks

elibenporat commented 4 years ago

First, thanks for taking this on, it is much appreciated. I assume you want us to use quick_xml::de::from_str()?

I'm having trouble getting anything to deserialize, I get an Err (Start,) for all the various XML files I tried replacing serde_xml_rs::from_str() to quick_xml::de::from_str().

For this link: http://gd2.mlb.com/components/game/rok/year_2008/month_07/day_10/gid_2008_07_10_dblrok_ddorok_2/linescore.xml

I had the following structs:

struct LineScoreData {
    game_pk: u32,
    game_type: char,
    venue: String,
    venue_w_chan_loc: String,
    venue_id: u32,
    time: String,
    time_zone: String,
    ampm: String,
    home_team_id: u32,
    home_team_city: String,
    home_team_name: String,
    home_league_id: u32,
    away_team_id: u32,
    away_team_city: String,
    away_team_name: String,
    away_league_id: u32,
    #[serde(rename="linescore", skip_serializing)]
    innings: Vec<LineScore>,
}
struct LineScore {
    #[serde(rename="away_inning_runs")]
    away_runs: u32,
    #[serde(rename="home_inning_runs", deserialize_with = "empty_string_is_none")]
    //needs to be an Option, since home team doesn't always bat.
    home_runs: Option<u32>,
    // Keeping the inning as a string, since we'll need it to construct URLs later
    inning: String,
}
tafia commented 4 years ago

Thank you for the quicjk response! I have made necessary modifications to parse it. Could you give it another try? Also, do you allow me to use this xml and add it to the test suite?

elibenporat commented 4 years ago

Works! The xml file is owned by Major League Baseball, but should be fine to use. You can of course use any of my code. https://github.com/jottabyte/rust-baseball has a few more examples.

I'll start testing the others later today and feed back any errors I bump into.

tafia commented 4 years ago

Awesome!

elibenporat commented 4 years ago

For this XML file: http://gd2.mlb.com/components/game/mlb/year_2008/month_05/day_10/gid_2008_05_10_cinmlb_nynmlb_1/players.xml

I used the following structs:

#[derive(Deserialize, Serialize, Debug)]
struct Game {
    #[serde(rename="team")]
    teams: Vec<Team>,
    umpires: Umpires
}

#[derive(Deserialize, Serialize, Debug)]
struct Team {
    #[serde(rename="type")]
    home_away: HomeAway,
    id: String,
    name: String,
    #[serde(rename="player")]
    players: Vec<Player>,
    #[serde(rename="coach")]
    coaches: Vec<Coach>,
}

#[derive(Deserialize, Serialize, Debug)]
enum HomeAway {
    #[serde(rename="home")]
    Home,
    #[serde(rename="away")]
    Away,
}

#[derive(Deserialize, Serialize, Debug, Clone)]
struct Player {
    id: u32,
    #[serde(rename="first")]
    name_first: String,
    #[serde(rename="last")]
    name_last: String,
    game_position: Option<String>,
    bat_order: Option<u8>,
    position: String,
}

#[derive(Deserialize, Serialize, Debug)]
struct Coach {
    position: String,
    #[serde(rename="first")]
    name_first: String,
    #[serde(rename="last")]
    name_last: String,
    id: u32,
}

This gave an error of "invalid type: string \"away\", expected enum HomeAway".

tafia commented 4 years ago

Fixed! Thanks again for the tests! Again I've added this file as part of the test suite. I may remove it before merging if there is any issue. Also would you mind continue the thread on the PR itself (#172) ?