CodeGen crash with "Queue is empty!" on lshr i256 instruction

chfast commented 9 years ago


Bugzilla Link	22304
Resolution	FIXED
Resolved on	Nov 11, 2015 17:33
Version	trunk
OS	All
Attachments	CodeGen test
CC	@atrick,@chfast,@qcolombet,@rotateright

Extended Description

Assertion failed: !Queue.empty() && "Queue is empty!", file llvm\lib\CodeGen\SelectionDAG\ScheduleDAGRRList.cpp, line 1701 Stack dump:

Program arguments: h:\Sources\libs\build\llvm-3.5.0-x64\Debug\bin\llc.exe bugpoint-reduced-simplified.ll
Running pass 'Function Pass Manager' on module 'bugpoint-reduced-simplified.ll'.
Running pass 'X86 DAG->DAG Instruction Selection' on function '@lshr_codegen_bug_queue_is_empty'

It happens on Windows and Linux, LLVM 3.5.0. I will check current trunk.

I used bugpoint to reduce the test. I'm pretty sure the point of failure is lshr i256 instruction because I reimplemented my algorithm without it and it worked.

rotateright commented 9 years ago

The fix here probably also resolves the question I had a while back in: bug 19594, comment 3

atrick commented 9 years ago

Author: Andrew Trick atrick@apple.com Date: Thu Mar 26 20:44:13 2015

Fix a bug in SelectionDAG scheduling backtracking code: llvm/llvm-project#22678 .

It can happen (by line CurSU->isPending = true; // This SU is not in
AvailableQueue right now.) that a SUnit is mark as available but is
not in the AvailableQueue. For SUnit being selected for scheduling
both conditions must be met.

This patch mainly defensively protects from invalid removing a node
from a queue. Sometimes nodes are marked isAvailable but are not in
the queue because they have been defered due to some hazard.

Patch by Pawel Bylica!

git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@233351 91177308-0d34-0410-b5e6-96231b3b80d8

chfast commented 9 years ago

Thanks for suggestions.

The following additional check fixes the problem: http://reviews.llvm.org/D8556

atrick commented 9 years ago

When the initial DAG is printed you see all the obvious dependencies between instructions. However, instructions that read/write EFLAGS don't have explicit dependencies. As the scheduler chooses instructions it tries to sort out the EFLAGS reader/writers. At that time, it may add new edges. To see the new edges that are added, you'll need to trace the scheduler, I think -debug-only=pre-RA-sched will do it.

qcolombet commented 9 years ago

Created attachment 14087 [details] CodeGen test 4

Another test, the smallest so far: has only 16 SUnits.

I will try to check that in debugger, but I don't know what the code does. Can anyone point me to some explanation that is this scheduling about?

This scheduling translates the DAG into a sequential order to be able to generate the MachineBasicBlock.

chfast commented 9 years ago

CodeGen test 4 Another test, the smallest so far: has only 16 SUnits.

I will try to check that in debugger, but I don't know what the code does. Can anyone point me to some explanation that is this scheduling about?

chfast commented 9 years ago

You were right, Andrew. Without backtracking tests compile. I also ran regression tests to check if nothing was broken. Everything seems to be fine. I will try to investigate the backtracking itself more.

atrick commented 9 years ago

I've seen problems like this before with larger-than legal types. The SelectionDAG scheduler has always had a very hard time with instructions that write EFLAGS and have multiple users. This comes up when operations on large types are split into smaller operations with carry-flag, like ADC.

The SelectionDAG scheduler has logic for cloning instructions to back itself out of an unschedulable case. See CopyAndMoveSuccessors. But before cloning, it tries to backtrack and add dependence edges in a way that the EFLAGS conflict won't occur. See DEBUG output "ARTIFICIAL edge from SU..."

The scheduler should check WillCreateCycle to avoid creating an unschedulable graph, but it may be that the book-keeping for marking dependence edges as scheduled is missing a case somewhere.

Forcing it to call into CopyAndMoveSuccessrs might be a way to brute force out of this situation. It would be interesting to know if that solves it.

Otherwise you need to debug the trace from -debug-only=pre-RA-sched and keep track of which dependence edges have been added and what nodes should be "ready".

I hope that helps a little.

qcolombet commented 9 years ago

I had a quick look at the produced DAG and my guess is that the scheduler failed to schedule what touch the EFLAGS. Indeed, a lot of instructions are writing to this with a lot of glue operators.

Andy knows this code much better than I do, so he will have a better diagnostic.

chfast commented 9 years ago

Both isel and sched dags look like acyclic. I haven't found any edge going down.

qcolombet commented 9 years ago

How can I check if it's cyclic-free? Can that be seen on isel graph?

You should be able to use -view-sched-dags (or something like that).

chfast commented 9 years ago

How can I check if it's cyclic-free? Can that be seen on isel graph?

qcolombet commented 9 years ago

CC'ed Andy since he is the scheduler expert.

I haven't looked at the problem myself yet, but usually when we fail to schedule the dag, it is because we have to many constraints and particular ones that cannot be satisfied.

Pavel, are your DAGs cycles-free?

chfast commented 9 years ago

CodeGen test 3 Yet another test. Very similar to test 2, but produces different assert failure.

chfast commented 9 years ago

CodeGen test 2 I've found a different but smaller test. This time it's around ctlz.i256.

chfast commented 9 years ago

I attached a debug log. Quentin, can you check if there is anything suspicious in the log?

chfast commented 9 years ago

pre-RA-sched log and assert callstack

chfast commented 9 years ago

To get specific logs you can use -debug-only=pre-RA-sched

rotateright commented 9 years ago

I've spent some time in the debugger to trace down the problem. The failure happens in "register scheduling" what is black magic to me. The problem might be register related because if one more instruction from test is eliminated llc does the right work.

We know that IR is valid. Can we get any next lower level of abstraction that I could check instruction by instruction?

Ah - you've already made some progress. Great!

The machine level is also black magic to me. :)

So yes, it looks like legalization has already happened successfully, and we're dying very near the end of compilation while optimizing the machine instructions.

-debug shows: *** Unscheduling [65]: SU(24): 0x7f8ae4803240: i64,i32 = ADC64ri8 0x7f8ae4055350, 0x7f8ae48054e0, 0x7f8ae48039b0:1 [ORD=2] [ID=24]

I don't have much experience here other than to look closely at the -debug output from llc and then step through in the debugger. If you have questions, you may try to contact Quentin Colombet on IRC or the mailing lists. He's listed as the code owner of the register allocators.

chfast commented 9 years ago

Hi Sanjay,

I've spent some time in the debugger to trace down the problem. The failure happens in "register scheduling" what is black magic to me. The problem might be register related because if one more instruction from test is eliminated llc does the right work.

We know that IR is valid. Can we get any next lower level of abstraction that I could check instruction by instruction?

I can also try to trace the type legalization if you think the problem is there.

rotateright commented 9 years ago

I would like to fix some bugs, but I'm not sure what the best workflow for that would be. I started by sending some tests to phabricator.

Hi Paweł -

Thanks for reducing the test case. I haven't look at the details of IR legalization, but if you want to fix the bug, I think you would look at LegalizeIntegerTypes.cpp: http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp

chfast commented 9 years ago

I reduced the test manually

chfast commented 9 years ago

CodeGen test

chfast commented 9 years ago

I use a lot of i256 arithmetic in my library and it is not so bad. I needed to implement mul & div myself and avoid shift left but in general it works.

I would like to fix some bugs, but I'm not sure what the best workflow for that would be. I started by sending some tests to phabricator.

rotateright commented 9 years ago

FWIW, I also asked about this on the dev list recently: http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-January/080875.html

rotateright commented 9 years ago

Bug 21184 is another case of a larger-than-legal shift going wrong.

Both of these bugs appear to be manifestations of a bigger problem: LLVM doesn't know how to deal with jumbo types, but the documentation doesn't admit it. This is discussed in bug 19797.

chfast commented 9 years ago

Confirmed with LLVM 3.7 SVN r226904.

chfast commented 9 years ago

assigned to @chfast

llvm / llvm-project

CodeGen crash with "Queue is empty!" on lshr i256 instruction #22678

Extended Description