riscvarchive / riscv-gcc

GNU General Public License v2.0
358 stars 277 forks source link

Why is there no nop in the assembly code when the compiler uses -O2? #145

Open LiHao217 opened 5 years ago

LiHao217 commented 5 years ago

Hello, take the liberty to ask you a question about gcc. When I compile a program with -O0, there will be nop instructions in the generated assembly code. There is no nop command when using -O2. I found that -O2 will use -fschedule-insns and -fschedule-insns2 for instruction scheduling. I used -fsched-verbose to print the scheduling information of this program and found that some places are like this.

_20190109221416

2 and 4 need to wait for one cycle. Here I think it should be necessary to add a nop in the assembly code, because if my processor is a single launch, there should be no error without this nop.

When using -O0 to compile the program, the instruction will not be dispatched, so there will be nop. After the instruction is dispatched, is the generated nop deleted? Because there is also a waiting cycle here, I think there will be a nop here.

I can't generate scheduling information when compiling with -O0, so I don't know how O0's nop is generated.

May i know what is this all about? Thank you very much for your answer. @jim-wilson

jim-wilson commented 5 years ago

Most hardware implementations have interlocks. If an instruction needs data that isn't available yet, then the pipeline stalls until the data is available. Or if this is an out-of-order execution pipeline, that one instruction stalls until the data is available, and other instructions may continue executing.

The nops emitted by gcc at -O0 are more of a historical artifact of implementation. These nops are not emitted at -O1 or higher. The nops may be emitted if we have a source line that no code needs to be emitted for. We emit a nop so that we have some place to attach the line number info with -g. The nops may be emitted if we have a node in the CFG that has no code in it. The nop is emitted so that simplifying the CFG won't cause that basic block to accidentally disappear. This stems from the design decision that the code emitted at -O0 should be simple, and quick and easy to generate, and should be easily debuggable, and that adding -g should never change the code emitted by gcc.

There are better ways to solve these problems when optimizing, when we are willing to use more time and memory to get a better answer, and hence when optimizing gcc has no need to emit nops normally.

If a target does not have hardware interlocks, then it is the job of the target backend to add any nops that may be necessary for correct operation on the target. There are no RISC-V targets at present supported by gcc that requires this.

There is no instruction scheduling at -O0.

LiHao217 commented 5 years ago

I don't think my machine has hardware interlocks. So I need to add some nop to him. Excuse me, where should I plug in nop? When I execute instruction scheduling in gcc, if I find a free period, do I add the corresponding nop in RTL? Then RTL will generate the corresponding nop when it converts the assembly code? I found that in the GCC RTL to assembly code there is a final function that will traverse all the RTL, and then the production will change the code, where final will call get_insns () to get each RTL, but I can not find the instruction to save and save Where will the final RTL code be saved? May I ask what I think? My English is a bit poor, I don't know if you can understand it. @jim-wilson

jim-wilson commented 5 years ago

Machines without interlocks or pipeline forwarding or other tricks to avoid hazards are not very common anymore, unless maybe this is your own personal design, and is your first cpu design.

If only a few instructions lack interlocks, like multiply/divide, then the easy solution is to add nops to those instruction patterns in the gcc/config/riscv/riscv.md file. You will also need to change the instruction length to be correct, counting the nops you added.

If a lot of instructions lack interlocks, or if you want better optimization, then you probably need to add a machine dependent optimization pass to go through the entire instruction stream, look at each instruction, and figure out where to emit nops, and how many nops in eash location. This can get pretty complicated, and requires a fair amount of gcc internals knowledge to implement. You can find an example in the mips port, as some old MIPS parts lack hardware interlocks for some instructions. In gcc/config/mips/mips.c, the mips_avoid_hazard function computes how many nops to emit after an instruction. This is ultimately called from the mips_machine_reorg2 machine dependent optimization pass. This uses hazard define_attr in mips.md to specify which instructions have hazards that require nops, and what kind of hazard they generate. See also mips_adjust_insn_length which is used to compute the instruction length plus the hazard avoiding nops. There may also be other things scattered around the MIPS port required to make this work.

There are also a few other ports that do this. You could try looking for other ports with "reorg" functions in the port .c file that can add nops, if you want other ideas on how to do this.

yxj1 commented 5 years ago

I can write an insert nop according to this nop. But I don't know how to locate the source code generated by this instruction. 360 16230317313634

@jim-wilson

jim-wilson commented 5 years ago

​If you want to do anything in this area, you will have to spend some time learning gcc internals. Or hire someone that already know how to do this kind of work. I gave pointers to code in the MIPS port that implements what you need.​

This will emit a nop emit_insn (gen_nop ()); but of course the hard part is writing the optimization pass to find where to emit the nops that you need.

Jim

yxj1 commented 5 years ago

Thank you very much for your help, I think I should find a solution.@jim-wilson