Open RKSimon opened 3 years ago
Okay, we can certainly keep it open.
A few ideas
-Call InferInstructionFlags and VerifyInstructionFlags before GenerateVariants and ExpandHwModeBasedTypes and don't call GenerateVariants or ExpandHwModeBasedTypes for GenInstrInfo.inc. This assumes that GenInstrInfo.inc doesn't care about the expanded patterns and only cares about the mayLoad/mayStore/etc flags and that those flags don't depend on the expansion.
-Replace the std::map in InfoByHwMode by a "small" data structure of some sort. It might be possible to just maintain a sorted SmallVector
I spent a couple of days fooling around with the TableGen lexer trying to speed it up. No joy there. I could knock off 1–2% of the time, which was probably just noise.
I think we can close this ticket.
I don't agree - until we've identified the next things to work on to reduce tblgen runtimes, we shouldn't close this.
I spent a couple of days fooling around with the TableGen lexer trying to speed it up. No joy there. I could knock off 1–2% of the time, which was probably just noise.
I think we can close this ticket.
If it helps, I just looked into the fairly nebulous "Analyze DAG patterns" and found that the call to CodeGenDAGPatterns::GenerateVariants takes about 50% of the -gen-instr-info time. I don't know how easy that would be to optimize.
Has anyone any ideas on what to do next on this?
This is a very general ticket, so if we have more specific improvements in mind, I'm happy for this to be resolved and those improvements to be handled in their own tickets.
Thanks - I can confirm my build times have definitely improved!
Hang on, are those two sets of timings comparing apples with apples? If so, nice!
I'm going to spend some time in the next few weeks looking at TableGen parsing to see if there is anything that can be done relatively easily to speed it up. I'm not optimistic, but it's worth a good look.
Certainly. As before, just using -time-phases in a release build:
[12/17] Building RISCVGenGlobalISel.inc... ===-------------------------------------------------------------------------=== TableGen Phase Timing ===-------------------------------------------------------------------------=== Total Execution Time: 5.4829 seconds (5.7914 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 4.5548 ( 86.4%) 0.1697 ( 81.0%) 4.7245 ( 86.2%) 4.8007 ( 82.9%) Backend overall 0.7185 ( 13.6%) 0.0398 ( 19.0%) 0.7583 ( 13.8%) 0.9906 ( 17.1%) Parse, build records 0.0001 ( 0.0%) 0.0000 ( 0.0%) 0.0002 ( 0.0%) 0.0002 ( 0.0%) Write output 5.2734 (100.0%) 0.2095 (100.0%) 5.4829 (100.0%) 5.7914 (100.0%) Total
[11/46] Building RISCVGenInstrInfo.inc... ===-------------------------------------------------------------------------=== TableGen Phase Timing ===-------------------------------------------------------------------------=== Total Execution Time: 5.3306 seconds (5.7432 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 4.0939 ( 79.8%) 0.1364 ( 68.3%) 4.2302 ( 79.4%) 4.3308 ( 75.4%) Analyze DAG patterns 0.6909 ( 13.5%) 0.0633 ( 31.7%) 0.7542 ( 14.1%) 1.0660 ( 18.6%) Parse, build records 0.1299 ( 2.5%) 0.0000 ( 0.0%) 0.1299 ( 2.4%) 0.1300 ( 2.3%) Emit helper methods 0.1100 ( 2.1%) 0.0000 ( 0.0%) 0.1100 ( 2.1%) 0.1101 ( 1.9%) Emit map table 0.0437 ( 0.9%) 0.0000 ( 0.0%) 0.0437 ( 0.8%) 0.0437 ( 0.8%) Emit InstrDesc records 0.0258 ( 0.5%) 0.0000 ( 0.0%) 0.0258 ( 0.5%) 0.0258 ( 0.4%) Emit operand info 0.0117 ( 0.2%) 0.0000 ( 0.0%) 0.0117 ( 0.2%) 0.0117 ( 0.2%) Emit operand type mappings 0.0104 ( 0.2%) 0.0000 ( 0.0%) 0.0104 ( 0.2%) 0.0104 ( 0.2%) Emit instruction names 0.0035 ( 0.1%) 0.0000 ( 0.0%) 0.0035 ( 0.1%) 0.0035 ( 0.1%) Emit logical operand size mappings 0.0035 ( 0.1%) 0.0000 ( 0.0%) 0.0035 ( 0.1%) 0.0035 ( 0.1%) Emit uses/defs 0.0034 ( 0.1%) 0.0000 ( 0.0%) 0.0034 ( 0.1%) 0.0034 ( 0.1%) Emit logical operand type mappings 0.0033 ( 0.1%) 0.0000 ( 0.0%) 0.0033 ( 0.1%) 0.0033 ( 0.1%) Emit operand name mappings 0.0009 ( 0.0%) 0.0000 ( 0.0%) 0.0009 ( 0.0%) 0.0009 ( 0.0%) Write output 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) 0.0000 ( 0.0%) Emit initialization routine 5.1310 (100.0%) 0.1997 (100.0%) 5.3306 (100.0%) 5.7432 (100.0%) Total
[13/16] Building RISCVGenDAGISel.inc... ===-------------------------------------------------------------------------=== TableGen Phase Timing ===-------------------------------------------------------------------------=== Total Execution Time: 6.6774 seconds (7.0148 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 3.6783 ( 56.4%) 0.1099 ( 71.8%) 3.7882 ( 56.7%) 3.8855 ( 55.4%) Parse patterns 1.2059 ( 18.5%) 0.0000 ( 0.0%) 1.2059 ( 18.1%) 1.2068 ( 17.2%) Convert to matchers 0.7158 ( 11.0%) 0.0365 ( 23.9%) 0.7524 ( 11.3%) 0.9905 ( 14.1%) Parse, build records 0.4000 ( 6.1%) 0.0000 ( 0.0%) 0.4000 ( 6.0%) 0.4006 ( 5.7%) Sort patterns 0.2748 ( 4.2%) 0.0000 ( 0.0%) 0.2748 ( 4.1%) 0.2752 ( 3.9%) Optimize matchers 0.2464 ( 3.8%) 0.0066 ( 4.3%) 0.2530 ( 3.8%) 0.2532 ( 3.6%) Emit matcher table 0.0031 ( 0.0%) 0.0000 ( 0.0%) 0.0031 ( 0.0%) 0.0031 ( 0.0%) Write output 6.5244 (100.0%) 0.1530 (100.0%) 6.6774 (100.0%) 7.0148 (100.0%) Total
One thing that may be worth mentioning now is that the various MC emitters (RISCVGenAsmWriter.inc, RISCVGenMCCodeEmitter.inc, RISCVGenDisassemblerTables.inc, RISCVGenAsmMatcher.inc) take 1.2 - 1.4 seconds each. Now that the other emitters have been reduced well, these are taking up a larger percentage of the time. It's 70-80% "Parse, build records" time for those.
Simon, are you able to tell how much things have improved? My debug build on my Linux server is currently about 29 seconds to generate RISCVGenInstrInfo.inc, but I think MSVC time might be higher?
Yeah, MSVC tends to be slower for debug builds.
I'm away from my main dev machine for the holiday weekend so haven't had chance to compare full builds yet.
Fraser - any chance you can repeat those timing reports against trunk latest please?
Simon, are you able to tell how much things have improved? My debug build on my Linux server is currently about 29 seconds to generate RISCVGenInstrInfo.inc, but I think MSVC time might be higher?
Summary of changes made so far -Expand variants before expanding HwMode. Since expand variants is quadratic in the number of patterns, it is better to do it before we double the patterns by expanding hw mode. -Patterns are stored in PatternToMatch now using ListInit * making them cheap to compare in GenerateVariants where we used to have a std::vector compare. -Removed BitVector based predicate filtering in GenerateVariants. This removed a bunch of heap overhead for storing a nxn matrix of bits where n is the number of patterns.
Minor things: -Replaced std::set or DenseSet with SmallSet or MachineValueTypeSet which should cut down on some heap allocation or hashing.
Patches to try to improve things https://reviews.llvm.org/D100691 https://reviews.llvm.org/D100695
I'm going to see if I can switch the predicates in PatternToMatch back to a simple to compare type like ListInit* which is what it was prior to the introduction of HwMode in 2017. That might allow us to remove the large std::vector of BitVector from GenerateVariants.
Or they are less helpful on RISCV than they were on X86.
There were two patches in 2018 that were intended to improve the performance of GenerateVariants https://reviews.llvm.org/D51035 and https://reviews.llvm.org/D51839 on debug builds on MSVC. They might have slowed down Release builds or non-MSVC builds. Need to do some more investigation.
GenerateVariants() is a phase of building a CodeGenDAGPatterns instance. Instances a built by DAGIselEmitter, FastIselEmitter, GlobalIselEmitter. InstrDocsEmitter, InstrInfoEmitter.
GenerateVariants() has one double-nested loop and one triple-nested loop.The first loop calls GenerateVariantsOf(), which is pretty hairy.
I don't know enough about variants to get a feeling for how this might be improved. I can spend some time studying it. Do you have a sense of where it could be improved?
I think a lot of time is spent in GenerateVariants. Do we need to do that for InstrInfo?
Yep, those are usually the three culprits. I should add more instrumentation to GlobalISel and InstrInfo so we can see which backend phases take the time. Then we can look at the longest phases and see if there is anything we can do.
In DagISel, the pattern parsing might be able to be improved.
Adding in some data I've generated from the RISC-V backend and intrinsics. The other tablegen targets take less than two second a piece.
[8/42] Building RISCVGenGlobalISel.inc... ===-------------------------------------------------------------------------=== TableGen Phase Timing ===-------------------------------------------------------------------------=== Total Execution Time: 15.9415 seconds (17.0387 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 14.8536 ( 95.4%) 0.3255 ( 86.4%) 15.1791 ( 95.2%) 15.4707 ( 90.8%) Backend overall 0.7109 ( 4.6%) 0.0510 ( 13.5%) 0.7619 ( 4.8%) 1.5641 ( 9.2%) Parse, build records 0.0005 ( 0.0%) 0.0000 ( 0.0%) 0.0005 ( 0.0%) 0.0040 ( 0.0%) Write output 15.5649 (100.0%) 0.3766 (100.0%) 15.9415 (100.0%) 17.0387 (100.0%) Total
[9/42] Building RISCVGenInstrInfo.inc... ===-------------------------------------------------------------------------=== TableGen Phase Timing ===-------------------------------------------------------------------------=== Total Execution Time: 16.0652 seconds (17.1984 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 14.4727 ( 91.9%) 0.2624 ( 81.2%) 14.7351 ( 91.7%) 14.9221 ( 86.8%) Analyze DAG patterns 0.7123 ( 4.5%) 0.0441 ( 13.7%) 0.7564 ( 4.7%) 1.7012 ( 9.9%) Parse, build records 0.1840 ( 1.2%) 0.0166 ( 5.1%) 0.2006 ( 1.2%) 0.2011 ( 1.2%) Emit helper methods 15.7420 (100.0%) 0.3232 (100.0%) 16.0652 (100.0%) 17.1984 (100.0%) Total
[2/5] Building RISCVGenDAGISel.inc... ===-------------------------------------------------------------------------=== TableGen Phase Timing ===-------------------------------------------------------------------------=== Total Execution Time: 18.6654 seconds (18.8064 wall clock)
---User Time--- --System Time-- --User+System-- ---Wall Time--- --- Name --- 13.9062 ( 76.2%) 0.3322 ( 81.5%) 14.2384 ( 76.3%) 14.2747 ( 75.9%) Parse patterns 2.2140 ( 12.1%) 0.0000 ( 0.0%) 2.2141 ( 11.9%) 2.2195 ( 11.8%) Convert to matchers 0.6775 ( 3.7%) 0.0520 ( 12.8%) 0.7295 ( 3.9%) 0.7349 ( 3.9%) Parse, build records 0.5580 ( 3.1%) 0.0000 ( 0.0%) 0.5580 ( 3.0%) 0.5593 ( 3.0%) Sort patterns 0.5381 ( 2.9%) 0.0000 ( 0.0%) 0.5381 ( 2.9%) 0.5394 ( 2.9%) Optimize matchers 0.3553 ( 1.9%) 0.0199 ( 4.9%) 0.3753 ( 2.0%) 0.3762 ( 2.0%) Emit matcher table 0.0085 ( 0.0%) 0.0034 ( 0.8%) 0.0120 ( 0.1%) 0.1024 ( 0.5%) Write output 18.2577 (100.0%) 0.4076 (100.0%) 18.6654 (100.0%) 18.8064 (100.0%) Total
No surprises there, I suppose.
Do you have some idea which backends are chewing up the most time?
No, I haven't done any profiling and I have no performance enhancements on my to-do list. Don't forget the llvm-tblgen -time-phases option, which can give you a quick idea where the time is being spent.
Thanks for the report Simon. Paul: I'm adding you to the CC on the off-chance you've done any profiling recently of tblgen or had any planned performance improvements in the works (just to avoid duplicated effort).
Extended Description
The RISC-V tblgen builds have been getting slower and slower, and now appears to be the slowest of the big three (AMDGPU, X86 and RISC-V), especially if I end up having to use debug builds.....
I realise slower builds is almost inevitable as RISC-V develops and adds features, but it'd be great if there was some low hanging cleanup that could be done to get this under control - whether its improvements to tblgen itself or the RISC-V td files get more efficient.