jeanthom / gram

DDR3 controller for nMigen (WIP)
Other
14 stars 1 forks source link

Reliability issues when doing memtests #39

Open jeanthom opened 4 years ago

jeanthom commented 4 years ago

We are currently running into a reliability issue with the memtests:

  1. The memtest fails whenever we introduce delay between the write and the read (this looks like a refresh issue and would be backed by #32)
  2. The memtest fails when we want to read/write too much data => same as 1? or address slicer issue (or similar)?
  3. Even if we don't fall into 1 or 2, we sometimes struggle to have a successful memtest
jeanthom commented 4 years ago

https://github.com/jeanthom/gram/commit/ce72afb0bc3871d6cb126ef5fc005208f5e95d2f improves #1 by a bit.

jeanthom commented 4 years ago

Is this a UARTBridge reliability issue?

jeanthom commented 4 years ago

There seems to be a pattern: Capture d’écran de 2020-07-24 19-31-01

jeanthom commented 4 years ago

Looks like a PHY error. That "desynchronization" is reproducible with the simulation testbench. What bothers me is that I can get good memtests on real hardware (not all the time), and I can't figure out what is the root cause of this D:

jeanthom commented 4 years ago

I ran some tests on a complete SoC (minimal Minerva system in soc.py). I get similar behavior to what I had with UARTBridge: https://gist.github.com/jeanthom/85c00ffc5402df95fcf4967ea806fe49

jeanthom commented 4 years ago

~The glitches in the gist above are related to the "-retime" option. Without "-retime" I still get memtest failure, but without the odd bitflips.~ Glitches also appear when the retime option isn't enabled.

jeanthom commented 4 years ago

TODO:

jeanthom commented 4 years ago

Fixing #38 seems to improve the situation, however in doing so I'm forced to "-retime" which might introduce bugs.

jeanthom commented 4 years ago

Synthesis without retiming isn't that much better.

I noticed that on a normal memtest we get those values for rdly:

Firmware launched...
DRAM init... done
Auto calibrating... done
Auto calibration profile:p0 rdly:00000002 p1 rdly:00000002
DRAM test... 
done

When the test is failed, we get different values for rdly:

Firmware launched...
DRAM init... done
Auto calibrating... done
Auto calibration profile:p0 rdly:00000003 p1 rdly:00000004
DRAM test... 
fail : *(0x1000000C) = DEFF00FF
fail : *(0x1000001C) = DEFF00FF
fail : *(0x1000002C) = DEFF00FF
fail : *(0x1000003C) = DEFF00FF
fail : *(0x1000004C) = DEFF00FF
fail : *(0x1000005C) = DEFF00FF
fail : *(0x1000006C) = DEFF00FF
fail : *(0x1000007C) = DEFF00FF
fail : *(0x1000008C) = DEFF00EF
fail : *(0x1000009C) = DEFF00FF
fail : *(0x100000AC) = DEFF00EF
Test canceled (more than 10 errors)
jeanthom commented 4 years ago

Actually we can also have error with rdly=2 (normal value for the ECPIX5):

Firmware launched...
DRAM init... done
Auto calibrating... done
Auto calibration profile:p0 rdly:00000002 p1 rdly:00000002
DRAM test... 
fail : *(0x10000000) = DEAF000C
fail : *(0x10000004) = DEAF0000
fail : *(0x10000008) = DEAF0004
fail : *(0x1000000C) = DEAF0008
fail : *(0x10000010) = DEAF001C
fail : *(0x10000014) = DEAF0010
fail : *(0x10000018) = DEAF0014
fail : *(0x1000001C) = DEAF0018
fail : *(0x10000020) = DEAF002C
fail : *(0x10000024) = DEAF0020
fail : *(0x10000028) = DEAF0024
Test canceled (more than 10 errors)
done
jeanthom commented 4 years ago

Here's where it gets a bit funky: I do all my testing on two 85F ECPIX-5 dev boards. One is R01, the other is R02, but the RAM routing hasn't really changed between the two revisions so I don't expect different behaviour between the two.

On the R02, I can't get a single test to pass, and it always fail like this:

fail : *(0x10000000) = DEAF000C
fail : *(0x10000004) = DEAF0000
fail : *(0x10000008) = DEAF0004
fail : *(0x1000000C) = DEAF0008

On the R01, I can get it to work 50-60% of the time, but when it fails, it fails like this:

fail : *(0x1000000C) = DEFF00FF
fail : *(0x1000001C) = DEFF00FF
fail : *(0x1000002C) = DEFF00FF
fail : *(0x1000003C) = DEFF00FF
jeanthom commented 4 years ago

In a sane memtest:

Readclksel: 0 1 2 3 4 5 6 7
Burstdet:   0 1 1 1 0 0 1 1

In a buggy memtest:

Readclksel: 0 1 2 3 4 5 6 7
Burstdet:   0 0 0 0 0 1 1 1
jeanthom commented 4 years ago

Looks like we are one clock cycle desynchronized... Why?

jeanthom commented 4 years ago

Taking a look at both p0 and p1 rdly:

Sane:

Rdly
p0: 01110011
Rdly
p1: 01110000

Non-functional: (results from p1 are garbage)

Rdly
p0: 00000111
Rdly
p1: 00000111