Open RoloEdits opened 12 months ago
Thanks for your research!
The next funtion its calling does not:
Yes, this is because this function implemented using a macro which is actually expands into a relatively long code. I never profiled library yet, so maybe all your suggestions are correct, but because I want to rewrite parser I think it is prematurally to optimize inlines now. Currently reading and parsing parts are very coupled which I think is not very good for readability and performance. Indirectly, this theory is confirmed by the fact that by rewriting the code in a new manner, I later discovered that the same approach is used in maybe_xml, which is now often 1.5-2x times faster than quick_xml.
Thanks for a dataset. Would be very grateful if you also share the code you used to do comparisons and draw graphs so we can get back to this topic after rewriting.
About calamine performance: if it uses serde-based parsing than it is significantly slow that working with raw events. I think it is because when you read events in a cycle, it is better for caches, then when you read one event and then process it, read another one and process it, etc. I have an idea to improve serde performance by buffering events first.
As a conclusion: I definitely want to look and improve performance and create the best XML library in Rust, it is just the next step after fixing some bugs which are coupled with current implementation which I hope will be fixed after rewriting.
Inlining is also a bit of a tricky form of optimization in general. If you bloat the code size too much, larger programs will get slower rather than faster. Debugging & profiling also gets more difficult, etc.
Certainly worth it in many cases, though. I agree w/ Mingun about taking a deeper look once the rewrites are done.
The code I changed was:
// reader/buffered_reader.rs
#[inline(always)]
$($async)? fn read_bytes_until $(<$lf>)? (
&mut self,
byte: u8,
buf: &'b mut Vec<u8>,
position: &mut usize,
) -> Result<Option<&'b [u8]>> {...}
#[inline(always)]
$($async)? fn read_element $(<$lf>)? (
&mut self,
buf: &'b mut Vec<u8>,
position: &mut usize,
) -> Result<Option<&'b [u8]>> {...}
#[inline(always)]
pub fn read_event_into<'b>(&mut self, buf: &'b mut Vec<u8>) -> Result<Event<'b>> {...}
#[inline(always)]
pub fn read_to_end_into(&mut self, end: QName, buf: &mut Vec<u8>) -> Result<Span> {...}
// reader/mod.rs
#[inline(always)]
pub fn buffer_position(&self) -> usize {...}
#[inline(always)]
fn read_event_impl<'i, B>(&mut self, mut buf: B) -> Result<Event<'i>>
where
R: XmlSource<'i, B>,
{...}
#[inline(always)]
fn read_until_open<'i, B>(&mut self, buf: B) -> Result<std::result::Result<Event<'i>, B>>
where
R: XmlSource<'i, B>,
{...}
#[inline(always)]
fn read_until_close<'i, B>(&mut self, buf: B) -> Result<Event<'i>>
where
R: XmlSource<'i, B>,
{...}
// reader/state.rs
#[inline(always)]
pub fn emit_text<'b>(&mut self, bytes: &'b [u8]) -> Result<Event<'b>> {...}
#[inline(always)]
pub fn emit_bang<'b>(&mut self, bang_type: BangType, buf: &'b [u8]) -> Result<Event<'b>> {...}
#[inline(always)]
pub fn emit_end<'b>(&mut self, buf: &'b [u8]) -> Result<Event<'b>> {...}
#[inline(always)]
pub fn emit_question_mark<'b>(&mut self, buf: &'b [u8]) -> Result<Event<'b>> {...}
#[inline(always)]
pub fn emit_start<'b>(&mut self, content: &'b [u8]) -> Result<Event<'b>> {...}
#[inline(always)]
pub fn close_expanded_empty(&mut self) -> Result<Event<'static>> {...}
This should be all of it. It was a pretty quick and dirty patch job, but the main idea was to create an end-to-end link with memchr
s own #[inline]
hints as well as just going through the function calls in the flamegraph and making them #[inline(always)]
.
For the plots, I just used a .ipynb
notebook with python:
import plotly.express as px
import polars as pl
df = pl.read_csv('CSV')
# requires that `master` time is in the first row
diff = df.item(column='time', row=0)
df = df.with_columns(
percentage_over_master = pl.lit((df.get_column('time') / diff - 1) * -1)
)
fig = px.bar(
x=df['implementation'],
y=df['time'],
template='plotly_dark',
text=df['time'],
height=800,
labels={'y': 'seconds', 'x': 'implementation', 'color': 'implementation'},
color=df['implementation'],
title='quick_xml `#[inline]` Comparison: Runtime'
)
fig.update_traces(textposition='outside', texttemplate = "%{text}s",)
fig.show()
fig = px.bar(
x=df['implementation'],
y=df['percentage_over_master'],
text=df['percentage_over_master'],
template='plotly_dark',
height=800,
labels={'y': 'percentage', 'x': 'implementation', 'color': 'implementation'},
color=df['implementation'],
text_auto='.4p',
title='quick_xml `#[inline]` Comparison: Percentage Over Master'
)
fig.update_traces(textposition='outside')
fig.update_yaxes(showticklabels=False)
fig.show()
The csv
was shaped like this:
implementation,time
`master`,5.941
`master` + `lto`,4.623
`#[inline]`,5.236
`#[inline(always)]`,4.232
`#[inline(always)]` + `lto`,4.148
I used hyperfine
for the runtime comparisons.
When looking at the code I noticed that there was a break in
#[inline]
propagation on functions. Given that#[inline]
is non-transitive, cross crate inline optimization can be very easily disrupted if the chain breaks at the wrong depth.When running tests with this setup:
I noticed that although this function has
#[inline]
:The next funtion its calling does not:
In trying to see the effects of the break in the
#[inline]
propgation, I ran a few tests. I went through and at first added#[inline]
to the functions so that there is an end-to-end#[inline]
chain, and then later went through and changed it to be#[inline(always)]
to see what the differences would be.I ran
lto=true
as well to see if there is any benefit to#[inline]
or if it just ultimately matches what a manual#[inline]
decoration brings, looking to see if rather than adding the attribute, just give a heads up in the docs that lto would benefit this crate to some degree.The xml file was a 1.3GB file from the worksheet of this dataset.
Benchmarks
Percentage over
master
:Seeing that the changes even benefited
lto
, I think this is a worth while add.Profile
Current
master
:#[inline]
:#[inline(always)]
:#[inline(always)]
+lto
:One thing I did notice was a
memcpy
instruction that was 25% of the samples:I'm not sure if this is part of the program or if its just an artifact of the profiling.