m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
422 stars 193 forks source link

Sayma MiSoC memory test failed #908

Closed jbqubit closed 6 years ago

jbqubit commented 6 years ago

Building .bit from source using

commit 440e19b8f9c8ebfce80402a519796cee7fdd6b06

I see...

$ flterm /dev/ttyUSB2

 __  __ _ ____         ____ 
|  \/  (_) ___|  ___  / ___|
| |\/| | \___ \ / _ \| |    
| |  | | |___) | (_) | |___ 
|_|  |_|_|____/ \___/ \____|

MiSoC Bootloader
Copyright (c) 2017 M-Labs Limited

Bootloader CRC passed
Initializing SDRAM...
Write leveling: 32 59 47 69 59 69 40 48 done
Read delays: 7:00-116 6:00-138 5:27-172 4:38-54 3:57-76 2:67-83 1:98-116 0:109-125 done
SDRAM initialized
Memory test failed (384482/1114624 words incorrect)
Halting.

 __  __ _ ____         ____ 
|  \/  (_) ___|  ___  / ___|
| |\/| | \___ \ / _ \| |    
| |  | | |___) | (_) | |___ 
|_|  |_|_|____/ \___/ \____|

MiSoC Bootloader
Copyright (c) 2017 M-Labs Limited

Bootloader CRC passed
Initializing SDRAM...
Write leveling: 34 56 45 65 55 67 40 47 done
Read delays: 7:00-112 6:01-127 5:15-33 4:26-44 3:65-84 2:78-94 1:90-123 0:103-123 done
SDRAM initialized
Memory test failed (412138/1114624 words incorrect)
Halting.

 __  __ _ ____         ____ 
|  \/  (_) ___|  ___  / ___|
| |\/| | \___ \ / _ \| |    
| |  | | |___) | (_) | |___ 
|_|  |_|_|____/ \___/ \____|

MiSoC Bootloader
Copyright (c) 2017 M-Labs Limited

Bootloader CRC passed
Initializing SDRAM...
Write leveling: 33 55 45 64 55 72 40 47 done
Read delays: 7:00-115 6:00-136 5:10-26 4:36-53 3:61-79 2:76-97 1:88-108 0:96-113 done
SDRAM initialized
Memory test failed (499333/1114624 words incorrect)
Halting.

|  \/  (_) ___|  ___  / ___|
| |\/| | \___ \ / _ \| |    
| |  | | |___) | (_) | |___ 
|_|  |_|_|____/ \___/ \____|

MiSoC Bootloader
Copyright (c) 2017 M-Labs Limited

Bootloader CRC passed
Initializing SDRAM...
Write leveling: 34 57 48 69 59 73 41 50 done
Read delays: 7:00-117 6:00-134 5:18-40 4:31-47 3:70-86 2:76-96 1:103-120 0:93-109 done
SDRAM initialized
Memory test failed (473026/1114624 words incorrect)
Halting.
whitequark commented 6 years ago

Possibly caused by 7429ee4fb63316b05da07407d6802670ebdb80fd?

cjbe commented 6 years ago

The typical valid read region seems to be ~170 LSB on my board, so I don't think that commit (increasing the initial step from 8 LSB to 16 LSB) caused this.

This also looks different from the problem that commit solved for me, where the size of the read window was always the size of the initial step. Here the gaps vary from 16 to 20.

sbourdeauducq commented 6 years ago

What Vivado version? We use 2017.4.

jbqubit commented 6 years ago

I'm using 2016.2. Will upgrade and try again.

marmeladapk commented 6 years ago

I'm using 2017.4 and I also got this issue, though with build from 25.01. I'm currently building against 0edc34a, will update when it finishes.

marmeladapk commented 6 years ago
MiSoC Bootloader
Copyright (c) 2017 M-Labs Limited

Bootloader CRC passed
Initializing SDRAM...
Write leveling: 58 89 78 88 56 85 47 48 done
Read delays: 7:00-19 6:07-23 5:53-74 4:60-76 3:111-132 2:117-133 1:125-141 0:133-151 done
SDRAM initialized
Memory test failed (522120/1114624 words incorrect)
Halting.
sbourdeauducq commented 6 years ago

Here is everything I built from the current master (with RTM bridge, RTIO and other things disabled to save compilation and RTM yak-shaving time): http://dl.free.fr/lAFdh3oQV With those binaries, I verified that SDRAM works fine on both Florent's board and Sayma-1. Can you try those binaries on your boards? @marmeladapk You can use the Ethernet TX clock phase adjustement script I posted in the RGMII issue on those binaries. @marmeladapk If the problem persists, can you use Sayma-2 that I shipped to you to debug Ethernet, since I didn't have SDRAM problems on that one?

enjoy-digital commented 6 years ago

thanks @sbourdeauducq. I'll look at that. The read leveling procedure is probably still not robust enough.

marmeladapk commented 6 years ago

@sbourdeauducq I loaded it to check if memory tests are passed:

Bootloader CRC passed
Initializing SDRAM...
Write leveling: 95 130 118 133 96 125 87 88 done
Read delays: 7:37-249 6:61-267 5:110-316 4:131-336 3:167-369 2:173-377 1:195-40e
SDRAM initialized
Memory test passed

Booting from flash...
Starting firmware.
[     0.000005s]  INFO(runtime): ARTIQ runtime starting...
[     0.003864s]  INFO(runtime): software version 4.0.dev+516.g0edc34a9
[     0.010126s]  INFO(runtime): gateware version 4.0.dev+516.g0edc34a9.dirty
[     0.016910s]  INFO(runtime): log level set to INFO by default
[     0.022630s]  INFO(runtime): UART log level set to INFO by default
[     0.028790s]  INFO(runtime): press 'e' to erase startup and idle kernels...
[     1.028006s]  INFO(runtime): continuing boot
[     1.030975s]  WARN(runtime): using default MAC address 02-00-00-00-76-01; ct
[     1.039568s]  INFO(runtime): using default IP address 192.168.1.60
[     1.054501s]  INFO(runtime::session): accepting network sessions
[     1.059438s]  INFO(runtime::session): running startup kernel
[     1.064959s]  INFO(runtime::session): no startup kernel found
[     1.070665s]  INFO(runtime::session): no connection, starting idle kernel
[     1.077527s]  INFO(runtime::session): no idle kernel found
[     1.084122s]  INFO(runtime::mgmt): management interface active
[     6.274350s]  WARN(runtime): ethernet mac: rx preamble errors: 2
[     7.357698s]  WARN(runtime): ethernet mac: rx preamble errors: 3
[    19.820658s]  WARN(runtime): ethernet mac: rx preamble errors: 4
[    20.128752s]  WARN(runtime): ethernet mac: rx preamble errors: 5
[    20.888642s]  WARN(runtime): ethernet mac: rx preamble errors: 6

So it works. I'll try the script you mentioned later.

sbourdeauducq commented 6 years ago

Good. You do however seem to get a large number of Ethernet RX corrupted packets (preamble errors). Is the PHY correctly set in RGMII mode? Does this happen for every packet? You can change the RX phase as well by using this script command instead: set_property CLKOUT0_PHASE <phase> [get_cells crg_ethrx_mmcm]

marmeladapk commented 6 years ago

@sbourdeauducq Should I change it in xdc in artiq/artiq_sayma/gateware/top.xdc and rebuild? Will latest artiq pass memory tests?

sbourdeauducq commented 6 years ago

No. Please follow the instruction in my comment: https://github.com/m-labs/artiq/issues/854#issuecomment-360497764 - you just save the script as edit_pll.tcl and run the mentioned vivado command. There is no bitstream rebuilding and it is a rather quick process. Nothing in the design other than the PLL phase will be changed, the routes etc. will be exactly as before, so yes memory test should be unaffected.

sbourdeauducq commented 6 years ago

I also see the problem with the default build (including SAWG) on ARTIQ 4c22d64ee438d8b65ba728829794698191719181, migen e554f072905ceeb27c9c179c8c7b785acd1676bc, misoc cb8e314c7515eade46f5bcde4e48903d7ec92490

Initializing SDRAM...
Write leveling: 43 66 49 68 35 56 34 25 done
Read delays: 7:00-121 6:00-141 5:39-55 4:44-60 3:67-85 2:76-92 1:105-121 0:113-129 done
SDRAM initialized
Memory test failed (356593/1114624 words incorrect)

When disabling the SAWG (--without-sawg), the system boots correctly. @enjoy-digital Can you move forward with JESD SC1 by disabling SAWG (which you want to do anyway to reduce compilation time)? I cannot reproduce the "no output on UART" bug.

enjoy-digital commented 6 years ago

@sbourdeauducq: yes i'll continue on Monday.

sbourdeauducq commented 6 years ago

@hartytp Are you looking into this? This definitely worked when I did the SAWG test and posted the scope screenshot. So it should be possible to isolate what code change exactly caused this problem, maybe with the help of tools like git-bisect. But I suspect this is due to the non-determinism of Vivado compilation, or to plain Vivado bugs. In the first case, this is normally solvable by adding appropriate timing constraints. In the second case, considering how Xilinx technical support has been degrading for the past years, the first option is basically to apply somewhat random non-functional changes to the code, as Xilinx engineers certainly do, and hope that ça tombe en marche, or try various Vivado synthesis options. (Xilinx's answer to the bug invasion is pretty much the usual) @whitequark's addition of RTM loading gateware is a good suspect for the triggering of this kind of Vivado misbehavior.

hartytp commented 6 years ago

@hartytp Are you looking into this?

I wasn't planning to, no. In general, I'm trying to prioritize things like the HMC830 on Sayma, which seem to be (at least in part) hardware issues. In contrast, the mem test thing is just firmware/gateware, isn't it? As such, it seemed like the standard yak shaving required to get a new board up and running, and not something particular to Sayma. So, I figured that you guys were probably best placed to look into it.

I have a busy week lined up this week, but I might have some time to look into it.

Side note: we've had Sayma for quite a while now, but the ARTIQ tool chain still feels quite hacked and fragile. It would be great to get to the point where Artiq flash can do the RTM as well, the package includes the correct version of JESD204B, etc.

hartytp commented 6 years ago

Anyway to be clear, in case I do find time to look into this, your plan is basically to dig through the git history, building various versions of Sayma gateware/firmware with SAWG (at a few hours per build) until we find the point where it stopped working? IIRC, that's a bit complicated by the fact that the tools to build Sayma have changed a bit over time, so it's not always the same instructions to build/flash it, and by the fact that the package doesn't include the right version of JESD204B (also misoc/migen?), so one needs to track the history of several projects to make sure that each build uses the correct version of each. Doesn't sound like fun.

sbourdeauducq commented 6 years ago

Doesn't sound like fun.

Yep, standard fare. Anyway, the first thing I'd try is removing the RTM loading gateware. Another thing that makes the SDRAM work is removing a lot of peripherals using the patch I posted elsewhere, so there would not be such versioning issues. Just the long Vivado compilation times.

hartytp commented 6 years ago

Well, as I said, as this seems like standard yak shaving for getting a board up and running, rather than a particular hardware/design issue with Sayma. So, do you mind taking a look at it first -- it's likely to be quicker for you since you've probably kept a closer eye on the changes that have been made to ARTIQ over the past weeks.

sbourdeauducq commented 6 years ago

That's what I thought - the patch below works around the problem.

diff --git a/artiq/gateware/targets/sayma_amc.py b/artiq/gateware/targets/sayma_amc.py
index c45f8d37a..f6c5b95f6 100755
--- a/artiq/gateware/targets/sayma_amc.py
+++ b/artiq/gateware/targets/sayma_amc.py
@@ -160,9 +160,9 @@ class Standalone(MiniSoC, AMPSoC):
         ]

         # RTM bitstream upload
-        rtm_fpga_cfg = platform.request("rtm_fpga_cfg")
-        self.submodules.rtm_fpga_cfg = SlaveFPGA(rtm_fpga_cfg)
-        self.csr_devices.append("rtm_fpga_cfg")
+        #rtm_fpga_cfg = platform.request("rtm_fpga_cfg")
+        #self.submodules.rtm_fpga_cfg = SlaveFPGA(rtm_fpga_cfg)
+        #self.csr_devices.append("rtm_fpga_cfg")

         # AMC/RTM serwb
         serwb_pll = serwb.phy.SERWBPLL(125e6, 625e6, vco_div=2)

@whitequark What about using GPIO and bit-banging instead? Hopefully the Vivado trash will behave then.

marmeladapk commented 6 years ago

@sbourdeauducq With latest commit (2d4a134) when I compile python3 -m artiq.gateware.targets.sayma_amc --without-sawg I still get memory test failed.

sbourdeauducq commented 6 years ago

Still? it always worked for me when using --without-sawg. Anyway for Ethernet debugging you can use the binaries.

sbourdeauducq commented 6 years ago

And with sawg?

marmeladapk commented 6 years ago

@sbourdeauducq

Anyway for Ethernet debugging you can use the binaries.

I wanted to insert probes.

And with sawg?

Will check now.

sbourdeauducq commented 6 years ago

I wanted to insert probes.

For debugging RX? The Ethernet core actually doesn't need SDRAM and will continue to receive frames even with a dead CPU system. Or you can try with a super-minimal design that will also have the advantage of reducing the compilation time. Some people don't like me saying that, but Sayma (and Ultrascale) is a trash fire, and the only way to make any progress is to cling to whatever still works...

marmeladapk commented 6 years ago

@sbourdeauducq I still get this error with SAWG.

sbourdeauducq commented 6 years ago

Those binaries are from ARTIQ 4.0.dev+521.g4c22d64e with the RTM loading gateware commented out. I tested that SDRAM works on the board when flashing them (and then booting from flash). http://dl.free.fr/mffzh7lVw

whitequark commented 6 years ago

@sbourdeauducq Since this bug is clearly not caused with my gateware based on this failure I'm not going to waste time rewriting this in some other way.

sbourdeauducq commented 6 years ago

Whatever the path of least resistance is. If GPIO bit-banging works around the problem, it's easier than tracking down the original problem which is likely a Vivado bug or something equally obscure.

marmeladapk commented 6 years ago

@sbourdeauducq With your binaries I still get memory test error.

hartytp commented 6 years ago

@sbourdeauducq Was going to have a look at synchronisation. Built the binaries with the latest ARTIQ, but still see

__ __ _ ____ ____ | \/ (_) ___| ___ / ___| | |\/| | \___ \ / _ \| | | | | | |___) | (_) | |___ |_| |_|_|____/ \___/ \____|

MiSoC Bootloader Copyright (c) 2017 M-Labs Limited

Bootloader CRC passed Initializing SDRAM... Write leveling: 25 62 57 74 52 61 14 26 done Read delays: 7:00-113 6:00-112 5:20-39 4:27-43 3:59-76 2:46-62 1:86-103 0:96-112 done SDRAM initialized Memory test failed (571059/1114624 words incorrect) Halting. Traceback (most recent call last):

Edit: that's not using the --without-sawg argument. # packages in environment at /home/ion/anaconda3/envs/artiq-sayma: # aiohttp 2.3.9 py35_0
alabaster 0.7.10 py35h6fb19ab_0
artiq-dev 4.0.dev py_540+git2adba3ed m-labs/label/dev async-timeout 2.0.0 py35h12a94dc_0
asyncserial 0.1 py_13+git340e430 m-labs/label/main babel 2.5.3 py35_0
binutils-or1k-linux 2.27 5 m-labs/label/main bscan-spi-bitstreams 0.10.0 2 m-labs/label/main ca-certificates 2017.08.26 h1d4fec5_0
certifi 2018.1.18 py35_0
cffi 1.11.4 py35h9745a5d_0
chardet 3.0.4 py35hb6e9ddf_1
colorama 0.3.9 py35h81e2b6c_0
coverage 4.4.2 py35h8fc71f1_0
dbus 1.12.2 hc3f9b76_1
docutils 0.14 py35hd11081d_0
expat 2.2.5 he0dffb1_0
fontconfig 2.12.4 h88586e7_1
freetype 2.8 hab7d2ae_1
glib 2.53.6 h5d9569c_2
gst-plugins-base 1.12.4 h33fb286_0
gstreamer 1.12.4 hb53b477_0
h5py 2.7.1 py35h8d53cdc_0
hdf5 1.10.1 h9caa474_1
icu 58.2 h9c2bf20_1
imagesize 0.7.1 py35hf008fae_0
intel-openmp 2018.0.0 hc7b2577_8
jinja2 2.10 py35h480ab6d_0
jpeg 9b h024ee3a_2
levenshtein 0.12.0 py35_1 m-labs/label/main libedit 3.1 heed3624_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 7.2.0 h7cc24e2_2
libgfortran-ng 7.2.0 h9f7466a_2
libgit2 0.24.1 7 m-labs/label/main libpng 1.6.34 hb9fc6fc_0
libssh2 1.7.0.git 5 m-labs/label/main libstdcxx-ng 7.2.0 h7a57d05_2
libusb 1.0.20 0 m-labs/label/main libxcb 1.12 hcd93eb1_4
libxml2 2.9.7 h26e45fe_0
lit 0.4.1 py_9 m-labs/label/main llvm-or1k 4.0.1 23 m-labs/label/main llvmlite-artiq 0.20.0 py35_1 m-labs/label/main markupsafe 1.0 py35h4f4fcf6_1
microscope 1.1 py_1 m-labs/label/main migen 0.7 py35_2+git40721b2 m-labs/label/dev misoc 0.9 py35_3+git684b519a m-labs/label/dev mkl 2018.0.1 h19d6760_4
msgpack-python 0.5.1 py35h6bb024c_0
multidict 3.3.2 py35he92878e_0
ncurses 6.0 h9df7e31_2
numpy 1.14.0 py35h3dfced4_1
openocd 0.10.0 4 m-labs/label/main openssl 1.0.2n hb7f436b_0
outputcheck 0.4.2 py_7 m-labs/label/main pcre 8.41 hc27e229_1
pip 9.0.1 py35h7e7da9d_4
prettytable 0.7.2 py35_1 conda-forge/label/main pycparser 2.18 py35h61b3040_1
pygit2 0.24.0 py35_4 m-labs/label/main pygments 2.2.0 py35h0f41973_0
pyqt 5.6.0 py35h0e41ada_5
pyqtgraph 0.10.0 py35_0
pyserial 3.4 py35h84edd1e_0
python 3.5.4 h417fded_24
python-dateutil 2.6.1 py35h90d5b31_1
pythonparser 1.1 py_8 m-labs/label/main pytz 2017.3 py35hb13c558_0
qt 5.6.2 h974d657_12
quamash 0.5.5 py_4 m-labs/label/main readline 7.0 ha6073c6_4
regex 2015.11.22 py35_1 m-labs/label/main rust-core-or1k 1.23.0 19 m-labs/label/main rustc 1.23.0 18 m-labs/label/main scipy 1.0.0 py35hcbbe4a2_0
setuptools 33.1.1 py35_0 conda-forge/label/main sip 4.18.1 py35h9eaea60_2
six 1.11.0 py35h423b573_1
snowballstemmer 1.2.1 py35h5435977_0
sphinx 1.4.8 py35_0
sphinx-argparse 0.1.13 py_4 m-labs/label/main sphinx_rtd_theme 0.2.4 py35_0
sphinxcontrib-wavedrom 1.1.0 py_1 m-labs/label/main sphinxcontrib-wavedrom 1.1.0 sqlite 3.22.0 h1bed415_0
system 5.8 2
tk 8.6.7 hc745277_3
wheel 0.30.0 py35hd3883cf_1
xz 5.2.3 h55aa19d_2
yarl 0.14.2 py35h31c3c03_0
zlib 1.2.11 ha838bed_2

hartytp commented 6 years ago

No memtest issues without the SAWG, but then I can't test synchronization. @sbourdeauducq thoughts?

__ __ _ ____ ____ | \/ (_) ___| ___ / ___| | |\/| | \___ \ / _ \| | | | | | |___) | (_) | |___ |_| |_|_|____/ \___/ \____| `MiSoC Bootloader` `Copyright (c) 2017 M-Labs Limited` Bootloader CRC passed Initializing SDRAM... Write leveling: 42 81 77 89 71 78 35 45 done Read delays: 7:00-164 6:04-175 5:49-216 4:61-229 3:98-244 2:97-250 1:122-273 0:125-278 done SDRAM initialized Memory test passed ` Booting from flash... Starting firmware. [ 0.000005s] INFO(runtime): ARTIQ runtime starting... [ 0.003866s] INFO(runtime): software version 4.0.dev+540.g2adba3ed [ 0.010131s] INFO(runtime): gateware version 4.0.dev+540.g2adba3ed [ 0.016392s] INFO(runtime): log level set to INFO by default [ 0.022115s] INFO(runtime): UART log level set to INFO by default [ 0.028266s] INFO(board_artiq::serwb): waiting for AMC/RTM serwb bridge to be ready... [ 0.707875s] WARN(board_artiq::serwb): AMC/RTM serwb bridge initialization failed, retrying. [ 1.386725s] WARN(board_artiq::serwb): AMC/RTM serwb bridge initialization failed, retrying. [ 2.065574s] WARN(board_artiq::serwb): AMC/RTM serwb bridge initialization failed, retrying.`

hartytp commented 6 years ago

I do see the usual splodges of red in the terminal during the build, like:

CRITICAL WARNING: [Timing 38-322] The clock arriving at pin ISERDESE3_47/CLK must have the same master clock as the clock arriving at pin ISERDESE3_47/CLKDIV, and the latter can only be phase shifted by 0/90/180/270 degrees. Any auto-derived clock on pin ISERDESE3_47/INTERNAL_DIVCLK will be created with 0 phase. [/home/ion/scratch/artiq/artiq_sayma/standalone/gateware/top.xdc:870]

I assume that none of this is an issue. How do you normally process Vivado's output to separate interesting warnings from the usual noise?

sbourdeauducq commented 6 years ago

I assume that none of this is an issue. How do you normally process Vivado's output to separate interesting warnings from the usual noise?

I don't have a solution for this nor the SDRAM problem. Among its many flaws, Vivado regularly spits out loads of spurious warnings, including when compiling Xilinx's own cores. The problem is not specific to Vivado; ISE and Altera/Intel Quartus also suffer from it.

Without SAWG, you still have JESD204, but only generating ramps.

hartytp commented 6 years ago

Are the software-programmed delays the same for the same board when the SDRAM works and when it does not?

See my posts above.

With SAWG:

Without SAWG:

So, no, not the same.

Without SAWG, you still have JESD204, but only generating ramps. Okay, will look at that next

hartytp commented 6 years ago

Rebuilt the gateware with SAWG and rebooted a few times looking at the read delays...

Write leveling: 9 48 42 52 32 38 7 7 done Read delays: 7:00-71 6:00-78 5:01-32 4:05-22 3:31-47 2:31-51 1:48-64 0:59-80 done

Write leveling: 9 48 40 51 35 36 2 7 done Read delays: 7:00-70 6:00-84 5:04-20 4:11-43 3:29-45 2:35-51 1:47-67 0:64-83 done

Write leveling: 10 46 43 50 37 44 5 8 done Read delays: 7:00-73 6:00-76 5:04-20 4:06-23 3:24-41 2:33-51 1:40-56 0:50-67 done

Write leveling: 10 50 40 51 36 39 7 7 done Read delays: 7:00-52 6:00-66 5:04-20 4:04-20 3:30-48 2:33-52 1:43-60 0:44-60 done

The previous time I built it (posted above) I got: Write leveling: 25 62 57 74 52 61 14 26 done Read delays: 7:00-113 6:00-112 5:20-39 4:27-43 3:59-76 2:46-62 1:86-103 0:96-112 done

Not sure if that's interesting data.

sbourdeauducq commented 6 years ago

The "write leveling" figures are also delays, and we don't know if it's the reads or writes that are failing.

hartytp commented 6 years ago

Updated.

The "write leveling" figures are also delays, and we don't know if it's the reads or writes that are failing.

Worth adding some more diagnostic info?

hartytp commented 6 years ago

And, for comparison, rebuilt without SAWG and saw(g):

Write leveling: 40 80 73 86 66 75 31 41 done Read delays: 7:00-150 6:02-168 5:45-210 4:57-225 3:91-237 2:86-242 1:112-262 0:119-270 done

Write leveling: 40 78 72 86 64 73 33 41 done Read delays: 7:00-152 6:00-170 5:39-202 4:53-222 3:87-234 2:90-242 1:114-260 0:115-270 done

Write leveling: 38 78 72 86 67 75 33 40 done Read delays: 7:00-155 6:02-167 5:39-206 4:58-224 3:92-233 2:88-239 1:115-261 0:118-270 done

Write leveling: 40 83 78 88 72 78 37 43 done Read delays: 7:00-161 6:02-177 5:48-219 4:58-232 3:98-247 2:95-252 1:119-276 0:131-281 done

The previous time I built it (posted above) I got:

Write leveling: 42 81 77 89 71 78 35 45 done Read delays: 7:00-164 6:04-175 5:49-216 4:61-229 3:98-244 2:97-250 1:122-273 0:125-278

Not sure if that's interesting data.

hartytp commented 6 years ago

So, does seem to be a reasonable difference between the two.

jbqubit commented 6 years ago

It's challenging to build a memory controller using Xilinx tools. Xilinx's own DDR3 core has a long list of bug (and fixes). https://www.xilinx.com/support/answers/69036.html

One route that may resolve this Issue is to let Xilinx handle these sort of subtitles. @sbourdeauducq Have you tried using the Xilinx DDR3 IP? https://www.xilinx.com/products/intellectual-property/ddr3.html

jbqubit commented 6 years ago

Please describe what the MiSoC tests do and how to interpret the test results.

@sbourdeauducq said in this thread:

Sayma (and Ultrascale) is a trash fire, and the only way to make any progress is to cling to whatever still works...

Whatever the path of least resistance is. If GPIO bit-banging works around the problem, it's easier than tracking down the original problem which is likely a Vivado bug or something equally obscure.

Among its many flaws, Vivado regularly spits out loads of spurious warnings, including when compiling Xilinx's own cores. The problem is not specific to Vivado; ISE and Altera/Intel Quartus also suffer from it.

But I suspect this is due to the non-determinism of Vivado compilation, or to plain Vivado bugs. In the first case, this is normally solvable by adding appropriate timing constraints. In the second case, considering how Xilinx technical support has been degrading for the past years, the first option is basically to apply somewhat random non-functional changes to the code, as Xilinx engineers certainly do, and hope that ça tombe en marche, or try various Vivado synthesis options. (Xilinx's answer to the bug invasion is pretty much the usual)

ACK that complex systems involve bugs. Xilinx undoubtedly has its share. However, your approach of "jiggling the box" in the hope that a functional .bit emerges is not productive in this case nor a good long term strategy.

In the case of non-determinism, Xilinx explicitly requests users to bring to their attention non-deterministic behavior of their tool chain. They also give a lot of advice on how to diagnose the situation. Have you tried finding a minimal, reproducible test case and engaging with Xilinx?

If the .bit are suspected to differ, a starting point is compare run-to-run hashes of the .bit generated by Vivado.

Have you tried turning off write-leveling?

Have you tried reducing the DDR3 clock speed?

Have you considered other variables outside of .bit itself that could cause MiSoC test variation?

sbourdeauducq commented 6 years ago

Have you tried turning off write-leveling?

Yes. I even went one step further and turned off the DDR3 completely: problem solved!

enjoy-digital commented 6 years ago

I need to have a closer look at that. I'll try to look at that this evening.

hartytp commented 6 years ago

The read/write delays should be the same for both gateware versions, right (w/wo SAWG)? As a temporary hack, would it make sense to just hard code the numbers to the values found without the SAWG?

sbourdeauducq commented 6 years ago

The read/write delays should be the same for both gateware versions, right (w/wo SAWG)?

In theory, yes.

As a temporary hack, would it make sense to just hard code the numbers to the values found without the SAWG?

Can try, but it's a shot in the dark - the root cause is not necessarily a failure of the timing calibration program (I think it's even unlikely to be the problem).

hartytp commented 6 years ago

Okay.

hartytp commented 6 years ago

@sbourdeauducq what's the plan for this? I'm itching to try Sayma with the SAWG and AFAICT this is currently the only thing stopping me.

sbourdeauducq commented 6 years ago

I don't know what is going on. Did you try those binaries on your board? They work on mine: https://github.com/m-labs/artiq/issues/908#issuecomment-364132304 Maybe try with the other DDR bank (the 32-bit one) or try using fewer bits on the main one (simply removing DQ/DQS/DM pins in Migen does it).

sbourdeauducq commented 6 years ago

It could be related to this: https://github.com/m-labs/misoc/issues/75