Closed chfast closed 9 years ago
The fix here probably also resolves the question I had a while back in: bug 19594, comment 3
Author: Andrew Trick atrick@apple.com Date: Thu Mar 26 20:44:13 2015
Fix a bug in SelectionDAG scheduling backtracking code: llvm/llvm-project#22678 .
It can happen (by line CurSU->isPending = true; // This SU is not in
AvailableQueue right now.) that a SUnit is mark as available but is
not in the AvailableQueue. For SUnit being selected for scheduling
both conditions must be met.
This patch mainly defensively protects from invalid removing a node
from a queue. Sometimes nodes are marked isAvailable but are not in
the queue because they have been defered due to some hazard.
Patch by Pawel Bylica!
git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@233351 91177308-0d34-0410-b5e6-96231b3b80d8
Thanks for suggestions.
The following additional check fixes the problem: http://reviews.llvm.org/D8556
When the initial DAG is printed you see all the obvious dependencies between instructions. However, instructions that read/write EFLAGS don't have explicit dependencies. As the scheduler chooses instructions it tries to sort out the EFLAGS reader/writers. At that time, it may add new edges. To see the new edges that are added, you'll need to trace the scheduler, I think -debug-only=pre-RA-sched will do it.
Created attachment 14087 [details] CodeGen test 4
Another test, the smallest so far: has only 16 SUnits.
I will try to check that in debugger, but I don't know what the code does. Can anyone point me to some explanation that is this scheduling about?
This scheduling translates the DAG into a sequential order to be able to generate the MachineBasicBlock.
CodeGen test 4 Another test, the smallest so far: has only 16 SUnits.
I will try to check that in debugger, but I don't know what the code does. Can anyone point me to some explanation that is this scheduling about?
You were right, Andrew. Without backtracking tests compile. I also ran regression tests to check if nothing was broken. Everything seems to be fine. I will try to investigate the backtracking itself more.
I've seen problems like this before with larger-than legal types. The SelectionDAG scheduler has always had a very hard time with instructions that write EFLAGS and have multiple users. This comes up when operations on large types are split into smaller operations with carry-flag, like ADC.
The SelectionDAG scheduler has logic for cloning instructions to back itself out of an unschedulable case. See CopyAndMoveSuccessors. But before cloning, it tries to backtrack and add dependence edges in a way that the EFLAGS conflict won't occur. See DEBUG output "ARTIFICIAL edge from SU..."
The scheduler should check WillCreateCycle to avoid creating an unschedulable graph, but it may be that the book-keeping for marking dependence edges as scheduled is missing a case somewhere.
Forcing it to call into CopyAndMoveSuccessrs might be a way to brute force out of this situation. It would be interesting to know if that solves it.
Otherwise you need to debug the trace from -debug-only=pre-RA-sched and keep track of which dependence edges have been added and what nodes should be "ready".
I hope that helps a little.
I had a quick look at the produced DAG and my guess is that the scheduler failed to schedule what touch the EFLAGS. Indeed, a lot of instructions are writing to this with a lot of glue operators.
Andy knows this code much better than I do, so he will have a better diagnostic.
Both isel and sched dags look like acyclic. I haven't found any edge going down.
How can I check if it's cyclic-free? Can that be seen on isel graph?
You should be able to use -view-sched-dags (or something like that).
How can I check if it's cyclic-free? Can that be seen on isel graph?
CC'ed Andy since he is the scheduler expert.
I haven't looked at the problem myself yet, but usually when we fail to schedule the dag, it is because we have to many constraints and particular ones that cannot be satisfied.
Pavel, are your DAGs cycles-free?
CodeGen test 3 Yet another test. Very similar to test 2, but produces different assert failure.
CodeGen test 2 I've found a different but smaller test. This time it's around ctlz.i256.
I attached a debug log. Quentin, can you check if there is anything suspicious in the log?
To get specific logs you can use -debug-only=pre-RA-sched
I've spent some time in the debugger to trace down the problem. The failure happens in "register scheduling" what is black magic to me. The problem might be register related because if one more instruction from test is eliminated llc does the right work.
We know that IR is valid. Can we get any next lower level of abstraction that I could check instruction by instruction?
Ah - you've already made some progress. Great!
The machine level is also black magic to me. :)
So yes, it looks like legalization has already happened successfully, and we're dying very near the end of compilation while optimizing the machine instructions.
-debug shows: *** Unscheduling [65]: SU(24): 0x7f8ae4803240: i64,i32 = ADC64ri8 0x7f8ae4055350, 0x7f8ae48054e0, 0x7f8ae48039b0:1 [ORD=2] [ID=24]
I don't have much experience here other than to look closely at the -debug output from llc and then step through in the debugger. If you have questions, you may try to contact Quentin Colombet on IRC or the mailing lists. He's listed as the code owner of the register allocators.
Hi Sanjay,
I've spent some time in the debugger to trace down the problem. The failure happens in "register scheduling" what is black magic to me. The problem might be register related because if one more instruction from test is eliminated llc does the right work.
We know that IR is valid. Can we get any next lower level of abstraction that I could check instruction by instruction?
I can also try to trace the type legalization if you think the problem is there.
I would like to fix some bugs, but I'm not sure what the best workflow for that would be. I started by sending some tests to phabricator.
Hi Paweł -
Thanks for reducing the test case. I haven't look at the details of IR legalization, but if you want to fix the bug, I think you would look at LegalizeIntegerTypes.cpp: http://reviews.llvm.org/diffusion/L/browse/llvm/trunk/lib/CodeGen/SelectionDAG/LegalizeIntegerTypes.cpp
I reduced the test manually
I use a lot of i256 arithmetic in my library and it is not so bad. I needed to implement mul & div myself and avoid shift left but in general it works.
I would like to fix some bugs, but I'm not sure what the best workflow for that would be. I started by sending some tests to phabricator.
FWIW, I also asked about this on the dev list recently: http://lists.cs.uiuc.edu/pipermail/llvmdev/2015-January/080875.html
Bug 21184 is another case of a larger-than-legal shift going wrong.
Both of these bugs appear to be manifestations of a bigger problem: LLVM doesn't know how to deal with jumbo types, but the documentation doesn't admit it. This is discussed in bug 19797.
Confirmed with LLVM 3.7 SVN r226904.
assigned to @chfast
Extended Description
Assertion failed: !Queue.empty() && "Queue is empty!", file llvm\lib\CodeGen\SelectionDAG\ScheduleDAGRRList.cpp, line 1701 Stack dump:
It happens on Windows and Linux, LLVM 3.5.0. I will check current trunk.
I used bugpoint to reduce the test. I'm pretty sure the point of failure is lshr i256 instruction because I reimplemented my algorithm without it and it worked.