Closed dfyz closed 4 years ago
cc @rust-lang/compiler @rustbot ping icebreakers-llvm
The release team is considering making a point release for Rust 1.41 (we briefly discussed it in last week's meeting), and I'd love for this to be included in it if we can get a PR up soon.
Hey LLVM ICE-breakers! This bug has been identified as a good "LLVM ICE-breaking candidate". In case it's useful, here are some instructions for tackling these sorts of bugs. Maybe take a look? Thanks! <3
cc @comex @DutchGhost @hanna-kruppe @hdhoang @heyrutvik @JOE1994 @jryans @mmilenko @nagisa @nikic @Noah-Kennedy @SiavoshZarrasvand @spastorino @vertexclique @vgxbj
Running it with cargo run --release using the stable 1.41.0 release prints something like this (tested on macOS 10.15 and Ubuntu 19.10):
I cannot reproduce this on playground. The program works fine there on 1.41.0 in release mode.
EDIT: Ah, you already said that. Also the program is fine in Miri, so this is likely not UB but a miscompilation.
Just to add a data point, I can reproduce this on Linux with the latest nightly:
[andrew@krusty rust-69225]$ rustc --version
rustc 1.43.0-nightly (5e7af4669 2020-02-16)
[andrew@krusty rust-69225]$ cat main.rs
fn do_test(x: usize) {
let arr = vec![vec![0u8; 3]];
let mut z = Vec::new();
for arr_ref in arr {
for y in 0..x {
for _ in 0..1 {
z.extend(std::iter::repeat(0).take(x));
let a = y * x;
let b = (y + 1) * x - 1;
let slice = &arr_ref[a..b];
eprintln!("{} {} {} {}", a, b, arr_ref.len(), slice.len());
eprintln!("{:?}", slice[1 << 24]);
}
}
}
}
fn main() {
do_test(1);
do_test(2);
}
[andrew@krusty rust-69225]$ rustc -C opt-level=3 main.rs
[andrew@krusty rust-69225]$ ./main
0 0 3 18446744073709551615
zsh: segmentation fault (core dumped) ./main
I was able to reproduce the above with the exact same output with Rust 1.41 stable. Rust 1.40 stable does not exhibit the problem:
$ ./main
0 0 3 0
thread 'main' panicked at 'index out of bounds: the len is 0 but the index is 16777216', main.rs:13:35
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
I think this is all consistent with @dfyz's report, except this at least confirms that it isn't macOS specific.
cargo-bisect-rustc
says the regression first happened in the2019-12-12
nightly, specifically in this commit. This seems suspicious to me, given that1.40.0
, which does not exhibit the problem, was released after this date.
This is expected. 1.40.0 was released on 2019-12-19 based on what was then the beta
branch, which was branched from master
six weeks earlier (around the time of the 1.39.0 release). See https://doc.rust-lang.org/book/appendix-07-nightly-rust.html for more about release channels.
If I had to guess, I'd say https://github.com/rust-lang/rust/pull/67015 is the likely culprit. This fixed 3 codegen issues, so it touches codegen-critical code. Cc @osa1 @oli-obk @wesleywiser
Thanks for the ping. I'm currently making a build to investigate. It's possible that this is a bug introduced with #67015. Another possibility is that #67015 just revealed an existing bug.
I managed to reproduce the segfault using rustc nightly on Linux, but not with my build generated with this config.toml:
Using rustc nightly I checked MIRs before and after ConstProp and the MIRs are identical. So if this is caused by ConstProp then it's because of a difference in generated code of a library, not of this program.
Regression in 033662dfbca088937b9cdfd3d9584015b5e375b2
@rustbot modify labels: -E-needs-bisection
@osa1 the debug-assertions = true
is probably to blame. When I try to compile (with a vanilla nightly compiler) the program with -C debug-assertions=y
the program panics instead of the segfault
I think I solved it! Reverting a983e0590a43ed8b0f60417828efd4e79b51f494 fixes the issue. It seemed like the culprit to me all day, but I couldn't test it at work :) Can someone help me how to best make a PR for this issue? I think the best way would be to add a testcase that must fail, but it seems this is very platform-specific etc, so maybe that is not a good idea after all? Any ideas here? Thanks!
(This was already bisected by the OP)
I managed to reproduce this with a local build with debug-assertions
turned off (thanks for that @hellow554).
Some of the PRs in the rollup cause conflicts when reverted because we have since then rustfmt
-ed everything, but I believe this issue is because of #67174.
edit: it seems we found this at the same time @shahn :)
@lqd yes, this is the issue that includes the commit that I referenced above. That is the culprit.
To add another data point, the problem disappears when codegen-units
is set to 3 or less (assuming release profile, with incremental=false
). In other words I'm able to reproduce when codegen-units
is 4 or greater.
The call to panic_bounds_check
disappears after LLVM Jump Threading pass. I can reproduce the issue with opt from LLVM 9, but not from LLVM 10.
So, I checked out and built a stage 1 rustc
(./x.py build -i --stage 1
), rebuilt libstd
(./x.py build -i --stage 1 --keep-stage 0 src/libstd
) without #67174, and recompiled the segfaulting program with four codegen units (rustc +stage1 -C opt-level=3 -C codegen-units=4 main.rs
). As expected, this made the segfault go away. If I apply #67174 back again, the segfault returns.
This means I now have two compilers which only differ in the standard library they use. Let's call these compilers GOOD
(no segfault) and BAD
(segfault).
I then noticed that the 4 unoptimized *.ll
files generated by GOOD
(-C no-prepopulate-passes
) are pretty much the same as the ones generated by BAD
(the only difference I saw were the different random IDs in function names), but the optimized *.ll
files (no -C no-prepopulate-passes
) are wildly different. I'm not a compiler expert in any way, but since the program being compiled is exactly the same in both cases, both compilers proper are exactly the same, and the only difference is in a precompiled standard library, I think LTO might be involved.
Indeed, if I pass any of -Z thinlto=no
, -C lto=no
, -C lto=yes
, -C lto=thin
(at this point I'm really not sure, but I guess all forms of -C lto
are different from ThinLTO, which is used by default) to BAD
, the segfault once again disappears.
Does it look likely that LTO might be at fault here?
I've read the test case. I've read the commit that is being reverted. I still haven't got the foggiest of what's going on. What broke?
What broke?
I don't believe at this point anyone can say for sure what broke exactly, but my tentative analysis is this (please take the following with a grain of salt, I'm not on either the Rust team or the LLVM team, all I can do is tinker with the compiler and stare at the LLVM IR):
Layout::repeat()
from the standard library, which eventually led to this memory unsafety. Mathematically, using unchecked addition here should be perfectly safe -- the comment in this function (and also in Layout::pad_to_align()
) explains why;Vec
, which implicitly uses Vec::reserve_internal()
, which in turns calls Layout::repeat()
;Layout::repeat()
is marked as #[inline]
, and apparently the only relevant difference between GOOD
and BAD
is whether this function is inlined into do_test()
or not. E.g., restoring overflow checks prohibits inlining and fixes the problem; removing the #[inline]
attribute causes the same effect; disabling LTO disables inlining for library functions and again fixes the problem.If this is true (again, I'm not 100% sure about any of the above), this means some rogue LLVM pass or a combination of passes misoptimizes the IR after inlining. That's what I'm currently trying to investigate, but sadly it's not easy (at least to a non-LLVM-ICE-breaker like me) because the IR diffs between GOOD
and BAD
are moderately large. It does look like the bad version omits panic_bounds_check
, but I'm still not exactly sure why.
Also, inspired by @tmiasko's comment, I tried to compile rustc with different LLVM versions, and it seems like LLVM 10rc1 fixes the problem (the latest faulty LLVM version I tried is 9.0.1). It doesn't look like the Jump Threading pass is to blame, though, because I don't see any relevant commits between 9.0.1 and 10rc1.
If reverting the Layout::repeat()
change only hides a symptom of one specific occurrence of an unrelated bug, is reverting really the right thing to do?
If reverting the Layout::repeat() change only hides a symptom of one specific occurrence of an unrelated bug, is reverting really the right thing to do?
I think it may be OK if:
If these hold then I think I'd revert the change, ship a minor release to unblock users (things worked fine without the change even though the bug was still there), and then focus on the actual bug.
In another compiler I remember actually doing this. We reverted a change in a release branch, but not on master (which is not a good practice, it caused problems later on), shipped a new minor release. Then fixed the actual bug.
In any case, as long as the bug will be prioritized and fixed, and the commit that makes the bug much easier to trigger is not a bug fix itself, I don't see any problems with reverting it for now.
So a question is, is this bug really easy to trigger? So far we’ve had one report with a somewhat involved test case where trying to minimize further (such as unrolling a seemingly trivial for _ in 0..1
loop) fails to reproduce.
I don't think the bug is easy to trigger, it looks like I just got particularly unlucky.
Anyway, I really appreciate the trouble @shahn went through to revert the Layout::new()
change, but IMO reverting it is not the right thing to do in this case. My reasoning (in addition to what @SimonSapin said):
Layout::repeat()
allows LLVM to inline Vec::reserve()
in release builds. It might give a nice performance boost in some cases (though this should be measured, of course);libcore/alloc.rs
(Layout::pad_to_align()
) uses the same pattern of unchecked addition with exactly the same comment explaining what makes it possible. Restoring the overflow checks in Layout::repeat()
but not in Layout::pad_to_align()
seems really weird to me;Maybe locally throwing in a release-included defensive assertion of the invariant as a pre-condition so that it panics with specific details to hunt down this specific edge-case or some debugger-fu? I would bet it's an unchecked calculation getting through under certain conditions.
Then, when it's tracked down (somewhere in LLVM as I just learned, ty @dyfz), a regression test case would be awesome so that it doesn't happen again. 🙏
LLVM IR reproducer: https://gist.github.com/comex/881074b1bcc545e299e65527c719eef4
Run opt bconfused.ll -scalar-evolution -loop-idiom -scalar-evolution -indvars -S -O3 -o - | grep xprint
. If the inside of the parentheses is i64 -1
, the buggy optimization happened. If it's not... it might not have, but it's hard to be sure.
It seems to come from LLVM incorrectly adding nuw
to add nuw i64 %x, -1
as part of the Induction Variable Simplification pass. x
is the argument to the function, and nuw
means no unsigned wrap, so this effectively asserts that the argument is 0, at a point in the function where it's not guaranteed to be.
Bisecting (edit: from LLVM 9 to LLVM 10, which @tmiasko said wasn't affected) produces this commit:
commit 58e8c793d0e43150a6452e971a32d7407a8a7401
Author: Tim Northover <tnorthover@apple.com>
Date: Mon Sep 30 07:46:52 2019 +0000
Revert "[SCEV] add no wrap flag for SCEVAddExpr."
This reverts r366419 because the analysis performed is within the context of
the loop and it's only valid to add wrapping flags to "global" expressions if
they're always correct.
llvm-svn: 373184
Looks promising, as r366419 (the commit that the above commit reverts) is included in the LLVM 9.0 branch Rust uses.
T-compiler triage: P-medium, based on following summary of situation:
pnkfelix: It seems like the remaining work items for #69225 are 1. fix LLVM (either by cherry-picking their 58e8c793d0e43150a6452e971a32d7407a8a7401 or by upgrading to LLVM 10) and then 2. readd PR #67174. pnkfelix: however neither of these strike me as high-priority items. pnkfelix: At least, this LLVM bug does not seem any better or worse than other LLVM codegen bugs. Which I guess is what @simulacrum just said.
Update: upgrade to LLVM 10 is being attempted in PR #67759
Update 2: It may be unwise to blindly cherry-pick their revert commit, since we presumably cherry-picked the original for some reason, and thus the revert could have unintended downstream effects. At the very least, we should not attempt it without understanding the consequences (and, given the effort to upgrade to LLVM 10, we probably should not attempt to cherry pick the revert at all, as it would be largely wasted effort...)
Was the original commit cherry-picked? At least from @comex' comment that is not clear to me ("is included in the LLVM 9.0 branch Rust uses" could also mean it's just part of LLVM 9.0).
The commit in question is a very local and small change that adds a single parameter to a function call and literally says it is safe [in this case] to add SCEV::FlagNSW
(and judging from the code, the new parameter can also be SCEV::FlagNUW
), so I think it's highly likely this is exactly what causes the misoptimization. I can confirm that removing this parameter (i.e., changing (void)getAddRecExpr(getAddExpr(StartVal, Accum, Flags), Accum, L, Flags);
to (void)getAddRecExpr(getAddExpr(StartVal, Accum), Accum, L, Flags);
) fixes the problem.
Moreover, this problematic commit was not cherry-picked. It's just bad luck — looks like the revert happened after 9.0.0 was created, so upstream 9.0.0 still has the offending parameter. The revert wasn't backported to 9.0.1 either, for some reason. 10.0.0-rc1 and later versions have the revert.
Here's a comment that explains why it is not, in fact, safe to add nsw
or nuw
here. It's probably a good idea to talk to an LLVM developer about this, but I think cherry-picking the revert will fix this issue and won't have any unintended effects at all, since it is so small and self-contained.
P.S. Huge kudos to @comex for root-causing this. Fantastic job.
FWIW I can confirm that https://github.com/llvm/llvm-project/commit/58e8c793d0e43150a6452e971a32d7407a8a7401 is safe to cherry-pick, it's a conservative change. See also https://lists.llvm.org/pipermail/llvm-dev/2019-September/135195.html if you're interested in more context regarding what the issue with SCEV nowrap flags is.
I think I just found a way to reproduce the issue even after reverting #67174. Here's a slightly longer, but still safe program that reliably segfaults on Windows, Linux and macOS using the latest nightly with #67174 reverted:
fn do_test(x: usize) {
let mut arr = vec![vec![0u8; 3]];
let mut z = vec![0];
for arr_ref in arr.iter_mut() {
for y in 0..x {
for _ in 0..1 {
z.reserve_exact(x);
let iterator = std::iter::repeat(0).take(x);
let mut cnt = 0;
iterator.for_each(|_| {
z[0] = 0;
cnt += 1;
});
let a = y * x;
let b = (y + 1) * x - 1;
let slice = &mut arr_ref[a..b];
slice[1 << 24] += 1;
}
}
}
}
fn main() {
do_test(1);
do_test(2);
}
Windows:
PS> rustup run nightly rustc --version
rustc 1.43.0-nightly (6d0e58bff 2020-02-23)
PS> rustup run nightly cargo run --release
Finished release [optimized] target(s) in 0.01s
Running `target\release\rust-segfault.exe`
error: process didn't exit successfully: `target\release\rust-segfault.exe` (exit code: 0xc0000005, STATUS_ACCESS_VIOLATION)
Linux:
$ rustup run nightly rustc --version
rustc 1.43.0-nightly (6d0e58bff 2020-02-23)
$ rustup run nightly cargo run --release
Finished release [optimized] target(s) in 1.13s
Running `target/release/rust-segfault`
Segmentation fault (core dumped)
macOS:
λ rustup run nightly rustc --version
rustc 1.43.0-nightly (6d0e58bff 2020-02-23)
λ rustup run nightly cargo run --release
Finished release [optimized] target(s) in 0.01s
Running `target/release/rust-segfault`
[1] 24331 segmentation fault rustup run nightly cargo run --release
This program doesn't depend on the number of codegen units, so it segfaults in the Playground, too (on stable, beta, and nightly). I also reproduced this by compiling rustc
from master (with #67174 reverted) linked against LLVM 9.
The underlying LLVM bug is still the same, so upgrading to LLVM 10 or cherry-picking the LLVM fix makes the segfault go away.
I really wish I understood what's going on better. It does look like the bounds checks are elided because of the extra nuw
, which comes from incorrectly cached SCEV values (just like in the C program from the thread @nikic linked to). But by the time the bad optimization happens, I can barely recognize my simple program through the layers of LLVM basic blocks; worse, any seemingly no-op change in the source code (e.g., removing the cnt
variable) leads to a very differently looking LLVM IR and makes the problem disappear.
My impression is that 1.41.1 has just been finalized in #69359 (bad timing on my part), so there is not much that can be done at this point. Is it at least a good idea to update the comment in Layout::repeat()
with a more detailed explanation of the LLVM issue? If so, I can send a PR.
My impression is that 1.41.1 has just been finalized in #69359 (bad timing on my part), so there is not much that can be done at this point.
If the patch we included in 1.41.1 doesn't actually fix the problem we should reconsider whether we want to backport the new fix and rebuild the release. There was consensus in the release team meeting not to backport the LLVM fix, but I personally think another this new PoC could warrant another discussion on the topic.
cc @Mark-Simulacrum @rust-lang/release
@dfyz we'll try to get another build of 1.41.1 with the LLVM fix backported, while we wait for consensus on actually shipping that.
FWIW, for me the new reproducer works as expected (index out of bounds
) on stable 1.38.0 and earlier, but segfaults on 1.39.0 and later. There's not a whole lot of difference in LLVM between 1.38 and 1.39 (https://github.com/rust-lang/llvm-project/compare/71fe7ec06b85f612fc0e4eb4134c7a7d0f23fac5...8adf9bdccfefb8d03f0e8db3b012fb41da1580a4), but it could be any sort of Rust-generated difference along the way too.
the new reproducer works as expected (index out of bounds) on stable 1.38.0
I (accidentally) found out that setting -C codegen-units=1
on 1.38.0 reproduces the segfault. 1.37.0 seems safe to me (no combination of options I tried produces a segfault).
Disregard that, 1.37.0 uses LLVM 8.
Curiously, the LLVM IR diff between 1.37.0 and 1.38.0 (with -C codegen-units=1
) is just one line:
- %71 = icmp eq {}* %70, null
+ %71 = icmp ule {}* %70, null
(where %70
is derived from the result of <core::slice::IterMut<T> as core::iter::traits::iterator::Iterator>::next()
)
This alone is enough to trick LLVM into adding the dreaded nuw
to add nuw i64 %x, -1
.
1.37.0 seems safe to me (no combination of options I tried produces a segfault).
That's using LLVM 8, so the blamed SCEV change shouldn't exist at all.
That's using LLVM 8
My bad, sorry for the confusion (I was so happy to reduce it to a one-line diff I didn't even bother checking the LLVM version).
We prepared new 1.41.1 artifacts with the LLVM fix cherry-picked in it. You can test them locally with:
RUSTUP_DIST_SERVER=https://dev-static.rust-lang.org rustup update stable
ping in https://github.com/rust-lang/rust/issues/69225#issuecomment-586941455
[triagebot] The issue was successfully resolved without any involvement from the pinged compiler team. Bretty good.
1.41.1 is out, I guess it's time to finally close this issue.
I have a small program (a simplification of a test function from a larger project) that slices a small array and tries to access an out-of-bounds element of the slice. Running it with
cargo run --release
using the stable1.41.0
release prints something like this (tested on macOS 10.15 and Ubuntu 19.10):It looks like the resulting slice somehow has length
2**64 - 1
, so the bounds checking is omitted, which predictably results in a segfault. On1.39.0
and1.40.0
the very same program prints what I would expect:The problem goes away if I do any of the following:
do_test(...);
calls inmain()
;for _ in 0..1 {
loop;for y in 0..x {
loop withfor y in 0..1 {
;z.extend(std::iter::repeat(0).take(x));
line or replace it withz.extend(std::iter::repeat(0).take(1));
;for arr_ref in arr {
loop withlet arr_ref = &arr[0];
;RUSTFLAGS="-C opt-level=2"
;RUSTFLAGS="-C codegen-units=1"
.My best guess is
-C opt-level=3
enables a problematic optimization pass in LLVM, which results in miscompilation. This is corroborated by the fact that MIR (--emit mir
) and LLVM IR before optimizations (--emit llvm-ir -C no-prepopulate-passes
) is the same for both-C opt-level=2
and-C opt-level=3
.Some additional info that might be helpful:
codegen-units = 1
);1.41.0
release (no idea what makes it different);cargo-bisect-rustc
says the regression first happened in the2019-12-12
nightly, specifically in this commit. This seems suspicious to me, given that1.40.0
, which does not exhibit the problem, was released after this date.I'm attaching the program inline in case the GitHub repo doesn't work (if you want to compile it without Cargo, use
rustc -C opt-level=3 main.rs
):