Non-Cascaded `#[inline]` Propagation

RoloEdits commented 12 months ago

When looking at the code I noticed that there was a break in #[inline] propagation on functions. Given that #[inline] is non-transitive, cross crate inline optimization can be very easily disrupted if the chain breaks at the wrong depth.

When running tests with this setup:

use quick_xml::events::Event;
use quick_xml::reader::Reader;

static XML: &str = include_str!(r"XML");

fn main() {
    quick_xml();
}

fn quick_xml() {
    let mut reader = Reader::from_str(XML);
    reader.trim_text(true);

    let mut buf = Vec::with_capacity(1024 * 8);

    loop {
        match reader.read_event_into(&mut buf) {
            Err(e) => panic!("Error at position {}: {:?}", reader.buffer_position(), e),
            Ok(Event::Eof) => break,
            Ok(Event::Start(_)) => {}
            Ok(Event::Text(_)) => {}
            _ => (),
        }
        buf.clear();
    }
}

I noticed that although this function has #[inline]:

    #[inline]
    pub fn read_event_into<'b>(&mut self, buf: &'b mut Vec<u8>) -> Result<Event<'b>> {
        self.read_event_impl(buf)
    }

The next funtion its calling does not:

fn read_event_impl<'i, B>(&mut self, mut buf: B) -> Result<Event<'i>>
    where
        R: XmlSource<'i, B>,
    {
        read_event_impl!(self, buf, self.reader, read_until_open, read_until_close)
    }

In trying to see the effects of the break in the #[inline] propgation, I ran a few tests. I went through and at first added #[inline] to the functions so that there is an end-to-end #[inline] chain, and then later went through and changed it to be #[inline(always)] to see what the differences would be.

I ran lto=true as well to see if there is any benefit to #[inline] or if it just ultimately matches what a manual #[inline] decoration brings, looking to see if rather than adding the attribute, just give a heads up in the docs that lto would benefit this crate to some degree.

The xml file was a 1.3GB file from the worksheet of this dataset.

Benchmarks

quick_xml: master branch
  Time (mean ± σ):      5.941 s ±  0.018 s    [User: 5.614 s, System: 0.350 s]
  Range (min … max):    5.915 s …  5.976 s    10 runs

quick_xml: master branch + lto
  Time (mean ± σ):      4.623 s ±  0.049 s    [User: 4.231 s, System: 0.395 s]
  Range (min … max):    4.579 s …  4.719 s    10 runs

// I might have missed coverage for this one, but the main functions were gotten for sure
quick_xml: inline
  Time (mean ± σ):      5.236 s ±  0.240 s    [User: 4.877 s, System: 0.353 s]
  Range (min … max):    5.123 s …  5.916 s    10 runs

quick_xml: inline(always)
  Time (mean ± σ):      4.232 s ±  0.022 s    [User: 3.843 s, System: 0.385 s]
  Range (min … max):    4.201 s …  4.255 s    10 runs

quick_xml: inline(always) + lto
  Time (mean ± σ):      4.148 s ±  0.017 s    [User: 3.744 s, System: 0.393 s]
  Range (min … max):    4.116 s …  4.162 s    10 runs

Percentage over master:

Seeing that the changes even benefited lto, I think this is a worth while add.

Profile

Current master:

#[inline]:

#[inline(always)]:

#[inline(always)] + lto:

One thing I did notice was a memcpy instruction that was 25% of the samples:

I'm not sure if this is part of the program or if its just an artifact of the profiling.

Mingun commented 12 months ago

Thanks for your research!

The next funtion its calling does not:

Yes, this is because this function implemented using a macro which is actually expands into a relatively long code. I never profiled library yet, so maybe all your suggestions are correct, but because I want to rewrite parser I think it is prematurally to optimize inlines now. Currently reading and parsing parts are very coupled which I think is not very good for readability and performance. Indirectly, this theory is confirmed by the fact that by rewriting the code in a new manner, I later discovered that the same approach is used in maybe_xml, which is now often 1.5-2x times faster than quick_xml.

Thanks for a dataset. Would be very grateful if you also share the code you used to do comparisons and draw graphs so we can get back to this topic after rewriting.

About calamine performance: if it uses serde-based parsing than it is significantly slow that working with raw events. I think it is because when you read events in a cycle, it is better for caches, then when you read one event and then process it, read another one and process it, etc. I have an idea to improve serde performance by buffering events first.

As a conclusion: I definitely want to look and improve performance and create the best XML library in Rust, it is just the next step after fixing some bugs which are coupled with current implementation which I hope will be fixed after rewriting.

dralley commented 12 months ago

Inlining is also a bit of a tricky form of optimization in general. If you bloat the code size too much, larger programs will get slower rather than faster. Debugging & profiling also gets more difficult, etc.

Certainly worth it in many cases, though. I agree w/ Mingun about taking a deeper look once the rewrites are done.

RoloEdits commented 11 months ago

The code I changed was:

// reader/buffered_reader.rs

 #[inline(always)]
        $($async)? fn read_bytes_until $(<$lf>)? (
            &mut self,
            byte: u8,
            buf: &'b mut Vec<u8>,
            position: &mut usize,
        ) -> Result<Option<&'b [u8]>> {...}

 #[inline(always)]
        $($async)? fn read_element $(<$lf>)? (
            &mut self,
            buf: &'b mut Vec<u8>,
            position: &mut usize,
        ) -> Result<Option<&'b [u8]>> {...}

#[inline(always)]
    pub fn read_event_into<'b>(&mut self, buf: &'b mut Vec<u8>) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn read_to_end_into(&mut self, end: QName, buf: &mut Vec<u8>) -> Result<Span> {...}

// reader/mod.rs

#[inline(always)]
    pub fn buffer_position(&self) -> usize {...}

 #[inline(always)]
    fn read_event_impl<'i, B>(&mut self, mut buf: B) -> Result<Event<'i>>
    where
        R: XmlSource<'i, B>,
    {...}

#[inline(always)]
    fn read_until_open<'i, B>(&mut self, buf: B) -> Result<std::result::Result<Event<'i>, B>>
    where
        R: XmlSource<'i, B>,
    {...}

 #[inline(always)]
    fn read_until_close<'i, B>(&mut self, buf: B) -> Result<Event<'i>>
    where
        R: XmlSource<'i, B>,
    {...}

// reader/state.rs

#[inline(always)]
    pub fn emit_text<'b>(&mut self, bytes: &'b [u8]) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn emit_bang<'b>(&mut self, bang_type: BangType, buf: &'b [u8]) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn emit_end<'b>(&mut self, buf: &'b [u8]) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn emit_question_mark<'b>(&mut self, buf: &'b [u8]) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn emit_start<'b>(&mut self, content: &'b [u8]) -> Result<Event<'b>> {...}

#[inline(always)]
    pub fn close_expanded_empty(&mut self) -> Result<Event<'static>> {...}

This should be all of it. It was a pretty quick and dirty patch job, but the main idea was to create an end-to-end link with memchrs own #[inline] hints as well as just going through the function calls in the flamegraph and making them #[inline(always)].

For the plots, I just used a .ipynb notebook with python:

import plotly.express as px
import polars as pl

df = pl.read_csv('CSV')

# requires that `master` time is in the first row
diff = df.item(column='time', row=0)

df = df.with_columns(
    percentage_over_master = pl.lit((df.get_column('time') / diff  - 1) * -1)
)

fig = px.bar(
    x=df['implementation'],
    y=df['time'],
    template='plotly_dark',
    text=df['time'],
    height=800,
    labels={'y': 'seconds', 'x': 'implementation', 'color': 'implementation'},
    color=df['implementation'],
    title='quick_xml `#[inline]` Comparison: Runtime'
    )

fig.update_traces(textposition='outside', texttemplate = "%{text}s",)

fig.show()

fig = px.bar(
    x=df['implementation'],
    y=df['percentage_over_master'],
    text=df['percentage_over_master'],
    template='plotly_dark',
    height=800,
    labels={'y': 'percentage', 'x': 'implementation', 'color': 'implementation'},
    color=df['implementation'],
    text_auto='.4p',
    title='quick_xml `#[inline]` Comparison: Percentage Over Master'
    )

fig.update_traces(textposition='outside')

fig.update_yaxes(showticklabels=False)

fig.show()

The csv was shaped like this:

implementation,time
`master`,5.941
`master` + `lto`,4.623
`#[inline]`,5.236
`#[inline(always)]`,4.232
`#[inline(always)]` + `lto`,4.148

I used hyperfine for the runtime comparisons.

tafia / quick-xml

Non-Cascaded `#[inline]` Propagation #678

Benchmarks

Profile