openrisc / or1k_marocchino

OpenRISC processor IP core based on Tomasulo algorithm
Other
29 stars 11 forks source link

FPU exceptions set incorrect EPCR #5

Closed stffrdhrn closed 5 years ago

stffrdhrn commented 5 years ago

According to the spec 6.3 Exception Processing. For FPU exceptions the EPCR should be set to Address of next not executed instruction.

This is the same as Tick Timer, External Interrupt and Syscall.

It seems currently FPU exceptions are being set to Address of instruction that caused exception. This is causing tests to fail.

Test: https://github.com/openrisc/binutils-gdb/blob/orfpx64a32/sim/testsuite/sim/or1k/fpu64a32.S

fusesoc run --target marocchino_tb --tool icarus mor1kx-generic --elf-load ./fpu64a32.S.x --trace_enable --trace_to_screen

S 00000d38: 24000000 l.rfe                                          flag: 1
// Instruction that caused the failure.
S 00002370: c84c0013 lf.div.d  r2,r12,r0                            flag: 1
S 00000d00: 9c21ff80 l.addi  r1,r1,0xff80    r1         = 000033bc  flag: 1
S 00000d04: 9c21fffc l.addi  r1,r1,0xfffc    r1         = 000033b8  flag: 1
S 00000d08: d4011000 l.sw    0x0000(r1),r2   [000033b8] = 00000001  flag: 1
S 00000d0c: 9c21fffc l.addi  r1,r1,0xfffc    r1         = 000033b4  flag: 1
S 00000d10: d4011800 l.sw    0x0000(r1),r3   [000033b4] = 00002054  flag: 1
// EPCR was loaded with 0x2370
S 00000d14: b4400020 l.mfspr r2,r0,0x0020    r2         = 00002370  flag: 1
S 00000d18: 18601500 l.movhi r3,0x1500       r3         = 15000000  flag: 1
S 00000d1c: a8630000 l.ori   r3,r3,0x0000    r3         = 15000000  flag: 1
S 00000d20: d7e21ffc l.sw    0xfffc(r2),r3   [0000236c] = 15000000  flag: 1
S 00000d24: 84610000 l.lwz   r3,0x0000(r1)   r3         = 00002054  flag: 1
S 00000d28: 9c210004 l.addi  r1,r1,0x0004    r1         = 000033b8  flag: 1
S 00000d2c: 84410000 l.lwz   r2,0x0000(r1)   r2         = 00000001  flag: 1
S 00000d30: 9c210004 l.addi  r1,r1,0x0004    r1         = 000033bc  flag: 1
S 00000d34: 9c210080 l.addi  r1,r1,0x0080    r1         = 0000343c  flag: 1
S 00000d38: 24000000 l.rfe                                          flag: 1
S 00002370: c84c0013 lf.div.d  r2,r12,r0                            flag: 1
S 00000d00: 9c21ff80 l.addi  r1,r1,0xff80    r1         = 000033bc  flag: 1
S 00000d04: 9c21fffc l.addi  r1,r1,0xfffc    r1         = 000033b8  flag: 1
S 00000d08: d4011000 l.sw    0x0000(r1),r2   [000033b8] = 00000001  flag: 1
bandvig commented 5 years ago

@stffrdhrn

Yes, you are right. And the behavior is similar to one for processing of TT/PIC interrupts. All of them don't follow arch. manual. We discussed TT/PIC interrupts processing here: https://github.com/openrisc/mor1kx/issues/74#issuecomment-462209653

Actually the FP-behavior also doesn't meet IEEE-standard completely. IEEE-standard expects that FP result should be "returned" (written into RF for our case) even so FP instruction generates exception. Moreover there are two approaches how to compute the returned value for overflow/underflow cases.

Of course my implementation follows simpler variant (it computes +/- inf for overflow and +/- 0 for underflow). But if FP related exceptions are enabled, FPU doesn’t “return” value because any exception discards exactly the instruction caused exception and write back is blocked. On the one hand the approach helps to simplify design as any instruction’s exception and any TT/PIC interrupt processed in the same manner. On the other hand it doesn’t follow arch. manual and leads to side effects. In particular I had to modify TestFloat in the following manners:

Fairly speaking, I’m not sure what would be better. To implement behavior as required by arh. manual and IEEE-standard. Or to leave it as is and re-write at least arh. manual. The 1-st way looks more accurate but it is not so trivial because brings to pipe design exclusive approach for handling FP exceptions.

stffrdhrn commented 5 years ago

I wanted to write a longer reply on this but didn't get the time.

I agree the spec can be updated. But I read over it. Is there anything saying that upon fp exception that the results should NOT be committed?

I think what you are saying is that if we allow the fp to commit it's results before jumping to the exception handler this will follow ieee spec and follow proper update of epcr. I don't see that as a bug change to the openrisc spec, more just a clarification.

On Wed, Mar 20, 2019, 12:09 AM Andrey Bacherov notifications@github.com wrote:

@stffrdhrn https://github.com/stffrdhrn

Yes, you are right. And the behavior is similar to one for processing of TT/PIC interrupts. All of them don't follow arch. manual. We discussed TT/PIC interrupts processing here: openrisc/mor1kx#74 (comment) https://github.com/openrisc/mor1kx/issues/74#issuecomment-462209653

Actually the FP-behavior also doesn't meet IEEE-standard completely. IEEE-standard expects that FP result should be "returned" (written into RF for our case) even so FP instruction generates exception. Moreover there are two approaches how to compute the returned value for overflow/underflow cases.

-

For simpler case overflow should return signed infinity while underflow should return signed zero.

For complex approach FPU should return tricky re-scaled value.

Of course my implementation follows simpler variant (it computes +/- inf for overflow and +/- 0 for underflow). But if FP related exceptions are enabled, FPU doesn’t “return” value because any exception discards exactly the instruction caused exception and write back is blocked. On the one hand the approach helps to simplify design as any instruction’s exception and any TT/PIC interrupt processed in the same manner. On the other hand it doesn’t follow arch. manual and leads to side effects. In particular I had to modify TestFloat in the following manners:

-

I have to run each TestFloat’s verification case twice: one time to collect flags with enabled FP exceptions and another one to get FP result with disabled FP exceptions

TestFloat expects tricky re-scaled values for overflow/underflow cases. I had to re-define such behavior. Unfortunately, I didn’t mark such cases in TestFloat sources accurate enough. At least I see there is defined BLOCK_UNDERFLOW_FLAG in softfloat.h that blocks some of such cases.

Additionally, it looks like there is difference TestFloat from IEEE-standard in places how to generate NaNs. And there is workaround with global variable checkNaNs = FALSE (in testLoops.c).

Fairly speaking, I’m not sure what would be better. To implement behavior as required by arh. manual and IEEE-standard. Or to leave it as is and re-write at least arh. manual. The 1-st way looks more accurate but it is not so trivial because brings to pipe design exclusive approach for handling FP exceptions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openrisc/or1k_marocchino/issues/5#issuecomment-474414315, or mute the thread https://github.com/notifications/unsubscribe-auth/ACSvn3asYP8cV2z3gXlLvVPJYIm6mEp7ks5vYP3HgaJpZM4b6w1n .

bandvig commented 5 years ago

@stffrdhrn

Stafford, I’m going to try to make FP exceptions processing be closer to arh. manual and IEEE-standard. I hope it will take several days including TestFloat upgrade. I’ll create fp_exceptions branch as it will be done. Are you still focused on writing FP test cases? Perhaps, it would be better to postpone them for a while. Is it OK for you?

stffrdhrn commented 5 years ago

This is ok with me. I was mainly trying to analyze some simple FP tests written in C, to see if I can reproduce any of the issues you were seeing with testfloat in order to see where the issues where. I haven't made much progress. I don't expect the EPCR issue will affect what I am working on.

bandvig commented 5 years ago

@stffrdhrn Stafford, I’ve created fp_exceptions branch with modified behavior of FP exceptions. Now they should meet arch. manual and be closer to IEEE-standard. I verified it with appropriately modified version of TestFloat (now you can find it in https://github.com/bandvig/or1k_testfloat/tree/fp_exceptions) . If you would want to use it, pay attention to notes in Readme.md. I also successfully run other soft from my OR1K verification pool (U-boot, Whetstone compiled for Single and Double precision, CoreMark, Dhrystone, Linux). Source code also passed Verilator’s check for lint warnings.

stffrdhrn commented 5 years ago

On Sun, Mar 24, 2019 at 06:41:46AM -0700, Andrey Bacherov wrote:

@stffrdhrn Stafford, I’ve created fp_exceptions branch with modified behavior of FP exceptions. Now they should meet arch. manual and be closer to IEEE-standard. I verified it with appropriately modified version of TestFloat (now you can find it in https://github.com/bandvig/or1k_testfloat/tree/fp_exceptions) . If you would want to use it, pay attention to notes in Readme.md. I also successfully run other soft from my OR1K verification pool (U-boot, Whetstone compiled for Single and Double precision, CoreMark, Dhrystone, Linux). Source code also passed Verilator’s check for lint warnings.

OK, let me have a look.

Is there any change to the OpenRISC spec needed? Is it just as I suggested before? or is it a bigger change?

-Stafford

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/openrisc/or1k_marocchino/issues/5#issuecomment-475960539

bandvig commented 5 years ago

@stffrdhrn

OK, let me have a look. Is there any change to the OpenRISC spec needed? Is it just as I suggested before? or is it a bigger change?

Not sure that I understand all of your questions clearly. Let me just give more details.

About TestFloat tricks I made and described above:

Potential bonus-track:

stffrdhrn commented 5 years ago

Hello,

This does answer my question. I think we need an update to the openrisc spec clarifying that results of fpu operations that cause expeptions sould be written back before the expeption is taken.

Also, I made one pull request to fix an issue with the rX + 2 logic. It's only for the higher registers. Also, for optimisation purposes we could do like is one in gcc and do register renaming instead of adjusting the store address.

For example.

r0-r15 store to index 0-15 r16-r30 even store to index 16-23 r17-r31 odd store to index 24-31.

16 18 20 22 24 26 28 30 16 17 18 19 20 21 22 23

17 19 21 23 25 27 29 31 24 25 26 27 28 29 30 31

I think, We just need to change the fetch rf decode stage (where I made the patch) to rename from instruction address to internal rf address. Then you can use the previous ram model.

Also, you mentioned you were seeing failures with the gcc9 toolchain. Can you clarify how to reproduce and what your saw? I have been testing with a few custom built c programs and assembly and things are looking ok. I must not be hitting certain edge cases.

On Mon, Mar 25, 2019, 5:21 PM Andrey Bacherov notifications@github.com wrote:

@stffrdhrn https://github.com/stffrdhrn

OK, let me have a look. Is there any change to the OpenRISC spec needed? Is it just as I suggested before? or is it a bigger change?

Not sure that I understand all of your questions clearly. Let me just give more details.

  • The fix addresses the issue you reported initially. Now if an FP exception occurs address of next not executed instruction is stored in EPCR. At the same time even so an FP instruction causes exception its result is stored in GPRs (that corresponds to IEEE-standard if I understand the standard correctly). Your tests shouldn't hangs up anymore. No need any change for OpenRISC spec because FP exceptions processing meets OpenRISC spec now.

About TestFloat tricks I made and described above:

-

Thanks to that TestFloat needn't run each instruction twice anymore. That eliminates first trick modification I made to run TestFloat successfully. The https://github.com/bandvig/or1k_testfloat/tree/fp_exceptions TestFloat version runs each FP-instruction one time to collect result an flags while FP exception handler is used just to check that FP exceptions could be processed some how. Just a counter implemented in FP exception handler. The counter value is printed at exit of TestFloat.

It doesn't fix issues related to generating NaNs or underflow flags. These SoftFloat-vs-FPU differences still exist. However I believe the issues fixing could be postponed while more important task is to complete support doubles in GCC9. And don't forget to set MAROCCHINO's parameter OPTION_ORFPX64A32_ABI to "GCC9".

Potential bonus-track:

  • Current implementation of TT/PIC-interrupts processing doesn't follow OpenRISC spec but follow CAPPUCINO's approach. Thanks to modifications made for FP-exceptions we potentially could make next step and implement TT/PIC-interrupts processing in according with OpenRISC spec.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/openrisc/or1k_marocchino/issues/5#issuecomment-476096505, or mute the thread https://github.com/notifications/unsubscribe-auth/ACSvnxE5AdYxhozLqlGx0o4YJtIincoHks5vaIbsgaJpZM4b6w1n .

bandvig commented 5 years ago

Hello, This does answer my question. I think we need an update to the openrisc spec clarifying that results of fpu operations that cause expeptions sould be written back before the expeption is taken.

Ah, understood. Also ORFPX64A32 description should be added. However, I couldn’t imagine yet how to correctly describe addresses of operands and destination with taking into account “remapping” for GPRs 16 … 31.

Also, you mentioned you were seeing failures with the gcc9 toolchain. Can you clarify how to reproduce and what your saw? I have been testing with a few custom built c programs and assembly and things are looking ok. I must not be hitting certain edge cases.

Yes, Double precision version of Whetstone hanged up if being compiled by GCC9 with -O2 option. But it was before your pull request #6 fpu: Properly support the +2 reg logic. With the modifications all work, so I merged your PR. However, I see that CI reports timeouts in or1k-tests again. It’s strange because first commit in fp_exceptions branch passed all tests and your modifications should not affects tests absolutely.

stffrdhrn commented 5 years ago

Let me see if I can make a patch for remapping.

It's good news that the whetstone issues are fixed. It means we are almost done. I'll look into the timeouts. None of the tests should use the reg +2 logic so it's strange.

stffrdhrn commented 5 years ago

The test is passing now. The Icarus source code failed to download. It's fixed now.

bandvig commented 5 years ago

Let me see if I can make a patch for remapping.

Wait a minute, Stafford. Which patch you are going to make? I meant that you PR #6 fixes issue with GCC9 ABI support in hardware, so I'm going to merge it into master branch.

The only thing I'm not sure is how it could be described in OpenRISC spec correctly. Something like this (for lf.add.d): ORFPX64A32-Implementation: {(rD+1+(rD>15))[31:0],rD[31:0]} <- {(rA+1+(rA>15))[31:0],rA[31:0]} + {(rB+1+(rB>15))[31:0],rB[31:0]} ? It looks really odd.

Or are you going to try to change GGC9 internals in a way which could restore back ability to use just {rX+2} in hardware?

stffrdhrn commented 5 years ago

Let me see if I can make a patch for remapping.

Wait a minute, Stafford. Which patch you are going to make? I meant that you PR #6 https://github.com/openrisc/or1k_marocchino/pull/6 fixes issue with GCC9 ABI support in hardware, so I'm going to merge it into master branch.

Yes, I agree what is there now is good and I have no plan to change the gcc9 abi.

But before you mentioned

Second. Initially, I implemented GPRs as set of four RAM-blocks to provide
{rX,rX+1} access to 64-bit operands for ORFPX64A32 extension. However, to
meet
GCC9 ABI ({rX,rX+2} access for 64-bit operands) I had to replace RAM-blocks
by
set of Flip-Flop based register. MAROCCHINO had become much larger and only
75
MHz CPU clock could be achieved now. The included 100 MHz Coremark results
were
measured on previous MAROCCHINO version (with RAM-based GPRs).

I was thinking, using a register renaming technique we can support both the gcc9 abi and still use a 4ram block solution running at 100mhz.

I just wanted to make a poc patch to show how it can be achieved. An extra improvement on the current core.

The only thing I'm not sure is how it could be described in OpenRISC spec

correctly. Something like this (for lf.add.d): ORFPX64A32-Implementation: {(rD + 1 +(rD > 15))[31:0],rD[31:0]} <- {(rA + 1 +(rA > 15))[31:0],rA[31:0]} + {(rB + 1 +(rB > 15))[31:0],rD[31:0]} ? It looks really odd.

I think we will need to show some examples. And split the above into multiple lines. Also we should explain the "why".

Or are you going to try to change GGC9 internals in a way which could restore back ability to use just {rX+2} in hardware?

No, plans.

bandvig commented 5 years ago

But before you mentioned Second. Initially, I implemented GPRs as set of four RAM-blocks to provide {rX,rX+1} access to 64-bit operands for ORFPX64A32 extension. However, to meet GCC9 ABI ({rX,rX+2} access for 64-bit operands) I had to replace RAM-blocks by set of Flip-Flop based register. MAROCCHINO had become much larger and only 75 MHz CPU clock could be achieved now. The included 100 MHz Coremark results were measured on previous MAROCCHINO version (with RAM-based GPRs).

Oh! Stafford! 4-RAM blocks RF design was restored many days ago :). And 100 MHz is achievable again. In fact we started or1k_marocchino repo with 4-RAM blocks RF design. You just missed this.

https://github.com/openrisc/or1k_marocchino/blob/93a5a34c41ccc8abec138421f4fd3e70077159dc/rtl/verilog/or1k_marocchino_rf.v#L531 https://github.com/openrisc/or1k_marocchino/blob/93a5a34c41ccc8abec138421f4fd3e70077159dc/rtl/verilog/or1k_marocchino_rf.v#L581 https://github.com/openrisc/or1k_marocchino/blob/93a5a34c41ccc8abec138421f4fd3e70077159dc/rtl/verilog/or1k_marocchino_rf.v#L631 https://github.com/openrisc/or1k_marocchino/blob/93a5a34c41ccc8abec138421f4fd3e70077159dc/rtl/verilog/or1k_marocchino_rf.v#L681

It was done by cost of one extra cpu_clock for write-back D2 (most significant word of double result).

I was thinking, using a register renaming technique we can support both the gcc9 abi and still use a 4ram block solution running at 100mhz. I just wanted to make a poc patch to show how it can be achieved. An extra improvement on the current core.

If you want to propose an alternative solution, you are welcome of course.