Open jeanthom opened 4 years ago
https://github.com/jeanthom/gram/commit/ce72afb0bc3871d6cb126ef5fc005208f5e95d2f improves #1 by a bit.
Is this a UARTBridge reliability issue?
There seems to be a pattern:
Looks like a PHY error. That "desynchronization" is reproducible with the simulation testbench. What bothers me is that I can get good memtests on real hardware (not all the time), and I can't figure out what is the root cause of this D:
I ran some tests on a complete SoC (minimal Minerva system in soc.py). I get similar behavior to what I had with UARTBridge: https://gist.github.com/jeanthom/85c00ffc5402df95fcf4967ea806fe49
~The glitches in the gist above are related to the "-retime" option. Without "-retime" I still get memtest failure, but without the odd bitflips.~ Glitches also appear when the retime option isn't enabled.
TODO:
Fixing #38 seems to improve the situation, however in doing so I'm forced to "-retime" which might introduce bugs.
Synthesis without retiming isn't that much better.
I noticed that on a normal memtest we get those values for rdly:
Firmware launched...
DRAM init... done
Auto calibrating... done
Auto calibration profile:p0 rdly:00000002 p1 rdly:00000002
DRAM test...
done
When the test is failed, we get different values for rdly:
Firmware launched...
DRAM init... done
Auto calibrating... done
Auto calibration profile:p0 rdly:00000003 p1 rdly:00000004
DRAM test...
fail : *(0x1000000C) = DEFF00FF
fail : *(0x1000001C) = DEFF00FF
fail : *(0x1000002C) = DEFF00FF
fail : *(0x1000003C) = DEFF00FF
fail : *(0x1000004C) = DEFF00FF
fail : *(0x1000005C) = DEFF00FF
fail : *(0x1000006C) = DEFF00FF
fail : *(0x1000007C) = DEFF00FF
fail : *(0x1000008C) = DEFF00EF
fail : *(0x1000009C) = DEFF00FF
fail : *(0x100000AC) = DEFF00EF
Test canceled (more than 10 errors)
Actually we can also have error with rdly=2 (normal value for the ECPIX5):
Firmware launched...
DRAM init... done
Auto calibrating... done
Auto calibration profile:p0 rdly:00000002 p1 rdly:00000002
DRAM test...
fail : *(0x10000000) = DEAF000C
fail : *(0x10000004) = DEAF0000
fail : *(0x10000008) = DEAF0004
fail : *(0x1000000C) = DEAF0008
fail : *(0x10000010) = DEAF001C
fail : *(0x10000014) = DEAF0010
fail : *(0x10000018) = DEAF0014
fail : *(0x1000001C) = DEAF0018
fail : *(0x10000020) = DEAF002C
fail : *(0x10000024) = DEAF0020
fail : *(0x10000028) = DEAF0024
Test canceled (more than 10 errors)
done
Here's where it gets a bit funky: I do all my testing on two 85F ECPIX-5 dev boards. One is R01, the other is R02, but the RAM routing hasn't really changed between the two revisions so I don't expect different behaviour between the two.
On the R02, I can't get a single test to pass, and it always fail like this:
fail : *(0x10000000) = DEAF000C
fail : *(0x10000004) = DEAF0000
fail : *(0x10000008) = DEAF0004
fail : *(0x1000000C) = DEAF0008
On the R01, I can get it to work 50-60% of the time, but when it fails, it fails like this:
fail : *(0x1000000C) = DEFF00FF
fail : *(0x1000001C) = DEFF00FF
fail : *(0x1000002C) = DEFF00FF
fail : *(0x1000003C) = DEFF00FF
In a sane memtest:
Readclksel: 0 1 2 3 4 5 6 7
Burstdet: 0 1 1 1 0 0 1 1
In a buggy memtest:
Readclksel: 0 1 2 3 4 5 6 7
Burstdet: 0 0 0 0 0 1 1 1
Looks like we are one clock cycle desynchronized... Why?
Taking a look at both p0 and p1 rdly:
Sane:
Rdly
p0: 01110011
Rdly
p1: 01110000
Non-functional: (results from p1 are garbage)
Rdly
p0: 00000111
Rdly
p1: 00000111
We are currently running into a reliability issue with the memtests: