Closed saethlin closed 1 year ago
Cc @Vanille-N
That's quite an obscure bug. It took two+ hours, but I finally have a lead. It's not confirmed yet, but many things point towards this being the problem.
TL;DR: I believe this is an unfortunate interaction between RangeMap
's iter_mut_all
and UniMap
's PartialEq::eq
.
Problem 1: I already knew that PartialEq
testing on UniMap
was not optimal, because maps that are equivalent are not always PartialEq
when using the #[derive()]
d impl. Indeed [Some(0), None, None, None] != [Some(0)]
although they are equivalent.
Problem 2: RangeMap
's iter_mut_all
does not merge identical adjacent cells, only iter_mut
does.
Clues:
Ctrl+C
'd) was split into 20000 different 1-byte ranges, despite all but the last few of them being visibly identical post-GCThis seems to result in adjacent ranges being not equal pre-GC and thus not merged, and even if they become equal post-GC the merge still doesn't happen because iter_mut_all
doesn't merge.
Note: it is very likely that serde1
and serde2
from the benches suffer from the same issue, and that in general most code that does a single pass of a large amount of bytes with a lot of per-byte work also exhibits this bad behavior.
Yeah that seems to be it.
Replacing the implementation of RangeMap::iter_mut_all
from
self.v.iter().map(|elem| (elem.range.clone(), &elem.data))
to
let start = self.v.first().map(|e| e.range.start).unwrap_or(0);
let end = self.v.last().map(|e| e.range.end).unwrap_or(0);
let len = end - start;
self.iter_mut(Size::from_bytes(start), Size::from_bytes(len))
makes the TB version terminate in 50s. I'll find a way to avoid doing too much useless work: iter_mut_all
doesn't need to do a dichotomy to find the first element and it never has to split ranges.
It's also not ideal that 10% of the execution time is spent in <[A] as core::slice::cmp::SlicePartialEq<B>>::equal
...
That is typical of pathological cases for Stacked Borrows. I remember considering doing tricks to implement the comparison as comparisons of &[u64]
because currently LLVM either doesn't figure out that it can do that because there is no padding in the T you are comparing, or there is padding actually. But once I got the GC working, it shrank the size of the stacks so much that I decided making the comparisons faster wasn't worth pursuing.
I think there should be some clever way to avoid trying to merge adjacent elements when the comparison is known to be unlikely to succeed. Or maybe we could be sloppy and just only attempt merging once every few visits.
We already stop comparing after N successful merges (I forgot which value I used for N, I remember tweaking it a bit). Maybe we should also stop after N failed merges, or do something probabilistic to make sure 'late' chunks also get the chance to merge again.
This should be fixed by https://github.com/rust-lang/miri/pull/2865?
Yup, looks fixed to me.
This test: https://github.com/hyperium/http/blob/3a2f3e01c01e60b894d1f5b30554013d990010c6/src/header/name.rs#L1685-L1705
On my machine runs in ~40 seconds with Stacked Borrows, and with Tree Borrows runs for > 10 minutes. The fraction of runtime in
keep_only_needed
seems to grow over time without bound.The offending loop is here: https://github.com/rust-lang/miri/blob/e602174a78c9b32253527ce87be1542d9c7bc1d9/src/borrow_tracker/tree_borrows/tree.rs#L533-L535
perf top
identifies: