Closed jbqubit closed 6 years ago
Possibly caused by 7429ee4fb63316b05da07407d6802670ebdb80fd?
The typical valid read region seems to be ~170 LSB on my board, so I don't think that commit (increasing the initial step from 8 LSB to 16 LSB) caused this.
This also looks different from the problem that commit solved for me, where the size of the read window was always the size of the initial step. Here the gaps vary from 16 to 20.
What Vivado version? We use 2017.4.
I'm using 2016.2. Will upgrade and try again.
I'm using 2017.4 and I also got this issue, though with build from 25.01. I'm currently building against 0edc34a, will update when it finishes.
MiSoC Bootloader
Copyright (c) 2017 M-Labs Limited
Bootloader CRC passed
Initializing SDRAM...
Write leveling: 58 89 78 88 56 85 47 48 done
Read delays: 7:00-19 6:07-23 5:53-74 4:60-76 3:111-132 2:117-133 1:125-141 0:133-151 done
SDRAM initialized
Memory test failed (522120/1114624 words incorrect)
Halting.
Here is everything I built from the current master (with RTM bridge, RTIO and other things disabled to save compilation and RTM yak-shaving time): http://dl.free.fr/lAFdh3oQV With those binaries, I verified that SDRAM works fine on both Florent's board and Sayma-1. Can you try those binaries on your boards? @marmeladapk You can use the Ethernet TX clock phase adjustement script I posted in the RGMII issue on those binaries. @marmeladapk If the problem persists, can you use Sayma-2 that I shipped to you to debug Ethernet, since I didn't have SDRAM problems on that one?
thanks @sbourdeauducq. I'll look at that. The read leveling procedure is probably still not robust enough.
@sbourdeauducq I loaded it to check if memory tests are passed:
Bootloader CRC passed
Initializing SDRAM...
Write leveling: 95 130 118 133 96 125 87 88 done
Read delays: 7:37-249 6:61-267 5:110-316 4:131-336 3:167-369 2:173-377 1:195-40e
SDRAM initialized
Memory test passed
Booting from flash...
Starting firmware.
[ 0.000005s] INFO(runtime): ARTIQ runtime starting...
[ 0.003864s] INFO(runtime): software version 4.0.dev+516.g0edc34a9
[ 0.010126s] INFO(runtime): gateware version 4.0.dev+516.g0edc34a9.dirty
[ 0.016910s] INFO(runtime): log level set to INFO by default
[ 0.022630s] INFO(runtime): UART log level set to INFO by default
[ 0.028790s] INFO(runtime): press 'e' to erase startup and idle kernels...
[ 1.028006s] INFO(runtime): continuing boot
[ 1.030975s] WARN(runtime): using default MAC address 02-00-00-00-76-01; ct
[ 1.039568s] INFO(runtime): using default IP address 192.168.1.60
[ 1.054501s] INFO(runtime::session): accepting network sessions
[ 1.059438s] INFO(runtime::session): running startup kernel
[ 1.064959s] INFO(runtime::session): no startup kernel found
[ 1.070665s] INFO(runtime::session): no connection, starting idle kernel
[ 1.077527s] INFO(runtime::session): no idle kernel found
[ 1.084122s] INFO(runtime::mgmt): management interface active
[ 6.274350s] WARN(runtime): ethernet mac: rx preamble errors: 2
[ 7.357698s] WARN(runtime): ethernet mac: rx preamble errors: 3
[ 19.820658s] WARN(runtime): ethernet mac: rx preamble errors: 4
[ 20.128752s] WARN(runtime): ethernet mac: rx preamble errors: 5
[ 20.888642s] WARN(runtime): ethernet mac: rx preamble errors: 6
So it works. I'll try the script you mentioned later.
Good. You do however seem to get a large number of Ethernet RX corrupted packets (preamble errors). Is the PHY correctly set in RGMII mode? Does this happen for every packet? You can change the RX phase as well by using this script command instead: set_property CLKOUT0_PHASE <phase> [get_cells crg_ethrx_mmcm]
@sbourdeauducq Should I change it in xdc in artiq/artiq_sayma/gateware/top.xdc and rebuild? Will latest artiq pass memory tests?
No. Please follow the instruction in my comment: https://github.com/m-labs/artiq/issues/854#issuecomment-360497764 - you just save the script as edit_pll.tcl
and run the mentioned vivado command.
There is no bitstream rebuilding and it is a rather quick process. Nothing in the design other than the PLL phase will be changed, the routes etc. will be exactly as before, so yes memory test should be unaffected.
I also see the problem with the default build (including SAWG) on ARTIQ 4c22d64ee438d8b65ba728829794698191719181, migen e554f072905ceeb27c9c179c8c7b785acd1676bc, misoc cb8e314c7515eade46f5bcde4e48903d7ec92490
Initializing SDRAM...
Write leveling: 43 66 49 68 35 56 34 25 done
Read delays: 7:00-121 6:00-141 5:39-55 4:44-60 3:67-85 2:76-92 1:105-121 0:113-129 done
SDRAM initialized
Memory test failed (356593/1114624 words incorrect)
When disabling the SAWG (--without-sawg
), the system boots correctly.
@enjoy-digital Can you move forward with JESD SC1 by disabling SAWG (which you want to do anyway to reduce compilation time)? I cannot reproduce the "no output on UART" bug.
@sbourdeauducq: yes i'll continue on Monday.
@hartytp Are you looking into this?
This definitely worked when I did the SAWG test and posted the scope screenshot. So it should be possible to isolate what code change exactly caused this problem, maybe with the help of tools like git-bisect
.
But I suspect this is due to the non-determinism of Vivado compilation, or to plain Vivado bugs. In the first case, this is normally solvable by adding appropriate timing constraints. In the second case, considering how Xilinx technical support has been degrading for the past years, the first option is basically to apply somewhat random non-functional changes to the code, as Xilinx engineers certainly do, and hope that ça tombe en marche, or try various Vivado synthesis options. (Xilinx's answer to the bug invasion is pretty much the usual)
@whitequark's addition of RTM loading gateware is a good suspect for the triggering of this kind of Vivado misbehavior.
@hartytp Are you looking into this?
I wasn't planning to, no. In general, I'm trying to prioritize things like the HMC830 on Sayma, which seem to be (at least in part) hardware issues. In contrast, the mem test thing is just firmware/gateware, isn't it? As such, it seemed like the standard yak shaving required to get a new board up and running, and not something particular to Sayma. So, I figured that you guys were probably best placed to look into it.
I have a busy week lined up this week, but I might have some time to look into it.
Side note: we've had Sayma for quite a while now, but the ARTIQ tool chain still feels quite hacked and fragile. It would be great to get to the point where Artiq flash can do the RTM as well, the package includes the correct version of JESD204B, etc.
Anyway to be clear, in case I do find time to look into this, your plan is basically to dig through the git history, building various versions of Sayma gateware/firmware with SAWG (at a few hours per build) until we find the point where it stopped working? IIRC, that's a bit complicated by the fact that the tools to build Sayma have changed a bit over time, so it's not always the same instructions to build/flash it, and by the fact that the package doesn't include the right version of JESD204B (also misoc/migen?), so one needs to track the history of several projects to make sure that each build uses the correct version of each. Doesn't sound like fun.
Doesn't sound like fun.
Yep, standard fare. Anyway, the first thing I'd try is removing the RTM loading gateware. Another thing that makes the SDRAM work is removing a lot of peripherals using the patch I posted elsewhere, so there would not be such versioning issues. Just the long Vivado compilation times.
Well, as I said, as this seems like standard yak shaving for getting a board up and running, rather than a particular hardware/design issue with Sayma. So, do you mind taking a look at it first -- it's likely to be quicker for you since you've probably kept a closer eye on the changes that have been made to ARTIQ over the past weeks.
That's what I thought - the patch below works around the problem.
diff --git a/artiq/gateware/targets/sayma_amc.py b/artiq/gateware/targets/sayma_amc.py
index c45f8d37a..f6c5b95f6 100755
--- a/artiq/gateware/targets/sayma_amc.py
+++ b/artiq/gateware/targets/sayma_amc.py
@@ -160,9 +160,9 @@ class Standalone(MiniSoC, AMPSoC):
]
# RTM bitstream upload
- rtm_fpga_cfg = platform.request("rtm_fpga_cfg")
- self.submodules.rtm_fpga_cfg = SlaveFPGA(rtm_fpga_cfg)
- self.csr_devices.append("rtm_fpga_cfg")
+ #rtm_fpga_cfg = platform.request("rtm_fpga_cfg")
+ #self.submodules.rtm_fpga_cfg = SlaveFPGA(rtm_fpga_cfg)
+ #self.csr_devices.append("rtm_fpga_cfg")
# AMC/RTM serwb
serwb_pll = serwb.phy.SERWBPLL(125e6, 625e6, vco_div=2)
@whitequark What about using GPIO and bit-banging instead? Hopefully the Vivado trash will behave then.
@sbourdeauducq With latest commit (2d4a134) when I compile python3 -m artiq.gateware.targets.sayma_amc --without-sawg
I still get memory test failed.
Still? it always worked for me when using --without-sawg. Anyway for Ethernet debugging you can use the binaries.
And with sawg?
@sbourdeauducq
Anyway for Ethernet debugging you can use the binaries.
I wanted to insert probes.
And with sawg?
Will check now.
I wanted to insert probes.
For debugging RX? The Ethernet core actually doesn't need SDRAM and will continue to receive frames even with a dead CPU system. Or you can try with a super-minimal design that will also have the advantage of reducing the compilation time. Some people don't like me saying that, but Sayma (and Ultrascale) is a trash fire, and the only way to make any progress is to cling to whatever still works...
@sbourdeauducq I still get this error with SAWG.
Those binaries are from ARTIQ 4.0.dev+521.g4c22d64e with the RTM loading gateware commented out. I tested that SDRAM works on the board when flashing them (and then booting from flash). http://dl.free.fr/mffzh7lVw
@sbourdeauducq Since this bug is clearly not caused with my gateware based on this failure I'm not going to waste time rewriting this in some other way.
Whatever the path of least resistance is. If GPIO bit-banging works around the problem, it's easier than tracking down the original problem which is likely a Vivado bug or something equally obscure.
@sbourdeauducq With your binaries I still get memory test error.
@sbourdeauducq Was going to have a look at synchronisation. Built the binaries with the latest ARTIQ, but still see
__ __ _ ____ ____
| \/ (_) ___| ___ / ___|
| |\/| | \___ \ / _ \| |
| | | | |___) | (_) | |___
|_| |_|_|____/ \___/ \____|
MiSoC Bootloader
Copyright (c) 2017 M-Labs Limited
Bootloader CRC passed
Initializing SDRAM...
Write leveling: 25 62 57 74 52 61 14 26 done
Read delays: 7:00-113 6:00-112 5:20-39 4:27-43 3:59-76 2:46-62 1:86-103 0:96-112 done
SDRAM initialized
Memory test failed (571059/1114624 words incorrect)
Halting.
Traceback (most recent call last):
Edit: that's not using the --without-sawg argument.
# packages in environment at /home/ion/anaconda3/envs/artiq-sayma:
#
aiohttp 2.3.9 py35_0
alabaster 0.7.10 py35h6fb19ab_0
artiq-dev 4.0.dev py_540+git2adba3ed m-labs/label/dev
async-timeout 2.0.0 py35h12a94dc_0
asyncserial 0.1 py_13+git340e430 m-labs/label/main
babel 2.5.3 py35_0
binutils-or1k-linux 2.27 5 m-labs/label/main
bscan-spi-bitstreams 0.10.0 2 m-labs/label/main
ca-certificates 2017.08.26 h1d4fec5_0
certifi 2018.1.18 py35_0
cffi 1.11.4 py35h9745a5d_0
chardet 3.0.4 py35hb6e9ddf_1
colorama 0.3.9 py35h81e2b6c_0
coverage 4.4.2 py35h8fc71f1_0
dbus 1.12.2 hc3f9b76_1
docutils 0.14 py35hd11081d_0
expat 2.2.5 he0dffb1_0
fontconfig 2.12.4 h88586e7_1
freetype 2.8 hab7d2ae_1
glib 2.53.6 h5d9569c_2
gst-plugins-base 1.12.4 h33fb286_0
gstreamer 1.12.4 hb53b477_0
h5py 2.7.1 py35h8d53cdc_0
hdf5 1.10.1 h9caa474_1
icu 58.2 h9c2bf20_1
imagesize 0.7.1 py35hf008fae_0
intel-openmp 2018.0.0 hc7b2577_8
jinja2 2.10 py35h480ab6d_0
jpeg 9b h024ee3a_2
levenshtein 0.12.0 py35_1 m-labs/label/main
libedit 3.1 heed3624_0
libffi 3.2.1 hd88cf55_4
libgcc-ng 7.2.0 h7cc24e2_2
libgfortran-ng 7.2.0 h9f7466a_2
libgit2 0.24.1 7 m-labs/label/main
libpng 1.6.34 hb9fc6fc_0
libssh2 1.7.0.git 5 m-labs/label/main
libstdcxx-ng 7.2.0 h7a57d05_2
libusb 1.0.20 0 m-labs/label/main
libxcb 1.12 hcd93eb1_4
libxml2 2.9.7 h26e45fe_0
lit 0.4.1 py_9 m-labs/label/main
llvm-or1k 4.0.1 23 m-labs/label/main
llvmlite-artiq 0.20.0 py35_1 m-labs/label/main
markupsafe 1.0 py35h4f4fcf6_1
microscope 1.1 py_1 m-labs/label/main
migen 0.7 py35_2+git40721b2 m-labs/label/dev
misoc 0.9 py35_3+git684b519a m-labs/label/dev
mkl 2018.0.1 h19d6760_4
msgpack-python 0.5.1 py35h6bb024c_0
multidict 3.3.2 py35he92878e_0
ncurses 6.0 h9df7e31_2
numpy 1.14.0 py35h3dfced4_1
openocd 0.10.0 4 m-labs/label/main
openssl 1.0.2n hb7f436b_0
outputcheck 0.4.2 py_7 m-labs/label/main
pcre 8.41 hc27e229_1
pip 9.0.1 py35h7e7da9d_4
prettytable 0.7.2 py35_1 conda-forge/label/main
pycparser 2.18 py35h61b3040_1
pygit2 0.24.0 py35_4 m-labs/label/main
pygments 2.2.0 py35h0f41973_0
pyqt 5.6.0 py35h0e41ada_5
pyqtgraph 0.10.0 py35_0
pyserial 3.4 py35h84edd1e_0
python 3.5.4 h417fded_24
python-dateutil 2.6.1 py35h90d5b31_1
pythonparser 1.1 py_8 m-labs/label/main
pytz 2017.3 py35hb13c558_0
qt 5.6.2 h974d657_12
quamash 0.5.5 py_4 m-labs/label/main
readline 7.0 ha6073c6_4
regex 2015.11.22 py35_1 m-labs/label/main
rust-core-or1k 1.23.0 19 m-labs/label/main
rustc 1.23.0 18 m-labs/label/main
scipy 1.0.0 py35hcbbe4a2_0
setuptools 33.1.1 py35_0 conda-forge/label/main
sip 4.18.1 py35h9eaea60_2
six 1.11.0 py35h423b573_1
snowballstemmer 1.2.1 py35h5435977_0
sphinx 1.4.8 py35_0
sphinx-argparse 0.1.13 py_4 m-labs/label/main
sphinx_rtd_theme 0.2.4 py35_0
sphinxcontrib-wavedrom 1.1.0 py_1 m-labs/label/main
sphinxcontrib-wavedrom 1.1.0
system 5.8 2
tk 8.6.7 hc745277_3
wheel 0.30.0 py35hd3883cf_1
xz 5.2.3 h55aa19d_2
yarl 0.14.2 py35h31c3c03_0
zlib 1.2.11 ha838bed_2
No memtest issues without the SAWG, but then I can't test synchronization. @sbourdeauducq thoughts?
__ __ _ ____ ____
| \/ (_) ___| ___ / ___|
| |\/| | \___ \ / _ \| |
| | | | |___) | (_) | |___
|_| |_|_|____/ \___/ \____|
`MiSoC Bootloader` `Copyright (c) 2017 M-Labs Limited`
Bootloader CRC passed
Initializing SDRAM...
Write leveling: 42 81 77 89 71 78 35 45 done
Read delays: 7:00-164 6:04-175 5:49-216 4:61-229 3:98-244 2:97-250 1:122-273 0:125-278 done
SDRAM initialized
Memory test passed
`
Booting from flash...
Starting firmware.
[ 0.000005s] INFO(runtime): ARTIQ runtime starting...
[ 0.003866s] INFO(runtime): software version 4.0.dev+540.g2adba3ed
[ 0.010131s] INFO(runtime): gateware version 4.0.dev+540.g2adba3ed
[ 0.016392s] INFO(runtime): log level set to INFO by default
[ 0.022115s] INFO(runtime): UART log level set to INFO by default
[ 0.028266s] INFO(board_artiq::serwb): waiting for AMC/RTM serwb bridge to be ready...
[ 0.707875s] WARN(board_artiq::serwb): AMC/RTM serwb bridge initialization failed, retrying.
[ 1.386725s] WARN(board_artiq::serwb): AMC/RTM serwb bridge initialization failed, retrying.
[ 2.065574s] WARN(board_artiq::serwb): AMC/RTM serwb bridge initialization failed, retrying.`
I do see the usual splodges of red in the terminal during the build, like:
CRITICAL WARNING: [Timing 38-322] The clock arriving at pin ISERDESE3_47/CLK must have the same master clock as the clock arriving at pin ISERDESE3_47/CLKDIV, and the latter can only be phase shifted by 0/90/180/270 degrees. Any auto-derived clock on pin ISERDESE3_47/INTERNAL_DIVCLK will be created with 0 phase. [/home/ion/scratch/artiq/artiq_sayma/standalone/gateware/top.xdc:870]
I assume that none of this is an issue. How do you normally process Vivado's output to separate interesting warnings from the usual noise?
I assume that none of this is an issue. How do you normally process Vivado's output to separate interesting warnings from the usual noise?
I don't have a solution for this nor the SDRAM problem. Among its many flaws, Vivado regularly spits out loads of spurious warnings, including when compiling Xilinx's own cores. The problem is not specific to Vivado; ISE and Altera/Intel Quartus also suffer from it.
Without SAWG, you still have JESD204, but only generating ramps.
Are the software-programmed delays the same for the same board when the SDRAM works and when it does not?
See my posts above.
With SAWG:
Without SAWG:
So, no, not the same.
Without SAWG, you still have JESD204, but only generating ramps. Okay, will look at that next
Rebuilt the gateware with SAWG and rebooted a few times looking at the read delays...
Write leveling: 9 48 42 52 32 38 7 7 done
Read delays: 7:00-71 6:00-78 5:01-32 4:05-22 3:31-47 2:31-51 1:48-64 0:59-80 done
Write leveling: 9 48 40 51 35 36 2 7 done
Read delays: 7:00-70 6:00-84 5:04-20 4:11-43 3:29-45 2:35-51 1:47-67 0:64-83 done
Write leveling: 10 46 43 50 37 44 5 8 done
Read delays: 7:00-73 6:00-76 5:04-20 4:06-23 3:24-41 2:33-51 1:40-56 0:50-67 done
Write leveling: 10 50 40 51 36 39 7 7 done
Read delays: 7:00-52 6:00-66 5:04-20 4:04-20 3:30-48 2:33-52 1:43-60 0:44-60 done
The previous time I built it (posted above) I got:
Write leveling: 25 62 57 74 52 61 14 26 done
Read delays: 7:00-113 6:00-112 5:20-39 4:27-43 3:59-76 2:46-62 1:86-103 0:96-112 done
Not sure if that's interesting data.
The "write leveling" figures are also delays, and we don't know if it's the reads or writes that are failing.
Updated.
The "write leveling" figures are also delays, and we don't know if it's the reads or writes that are failing.
Worth adding some more diagnostic info?
And, for comparison, rebuilt without SAWG and saw(g):
Write leveling: 40 80 73 86 66 75 31 41 done
Read delays: 7:00-150 6:02-168 5:45-210 4:57-225 3:91-237 2:86-242 1:112-262 0:119-270 done
Write leveling: 40 78 72 86 64 73 33 41 done
Read delays: 7:00-152 6:00-170 5:39-202 4:53-222 3:87-234 2:90-242 1:114-260 0:115-270 done
Write leveling: 38 78 72 86 67 75 33 40 done
Read delays: 7:00-155 6:02-167 5:39-206 4:58-224 3:92-233 2:88-239 1:115-261 0:118-270 done
Write leveling: 40 83 78 88 72 78 37 43 done
Read delays: 7:00-161 6:02-177 5:48-219 4:58-232 3:98-247 2:95-252 1:119-276 0:131-281 done
The previous time I built it (posted above) I got:
Write leveling: 42 81 77 89 71 78 35 45 done
Read delays: 7:00-164 6:04-175 5:49-216 4:61-229 3:98-244 2:97-250 1:122-273 0:125-278
Not sure if that's interesting data.
So, does seem to be a reasonable difference between the two.
It's challenging to build a memory controller using Xilinx tools. Xilinx's own DDR3 core has a long list of bug (and fixes). https://www.xilinx.com/support/answers/69036.html
One route that may resolve this Issue is to let Xilinx handle these sort of subtitles. @sbourdeauducq Have you tried using the Xilinx DDR3 IP? https://www.xilinx.com/products/intellectual-property/ddr3.html
Please describe what the MiSoC tests do and how to interpret the test results.
@sbourdeauducq said in this thread:
Sayma (and Ultrascale) is a trash fire, and the only way to make any progress is to cling to whatever still works...
Whatever the path of least resistance is. If GPIO bit-banging works around the problem, it's easier than tracking down the original problem which is likely a Vivado bug or something equally obscure.
Among its many flaws, Vivado regularly spits out loads of spurious warnings, including when compiling Xilinx's own cores. The problem is not specific to Vivado; ISE and Altera/Intel Quartus also suffer from it.
But I suspect this is due to the non-determinism of Vivado compilation, or to plain Vivado bugs. In the first case, this is normally solvable by adding appropriate timing constraints. In the second case, considering how Xilinx technical support has been degrading for the past years, the first option is basically to apply somewhat random non-functional changes to the code, as Xilinx engineers certainly do, and hope that ça tombe en marche, or try various Vivado synthesis options. (Xilinx's answer to the bug invasion is pretty much the usual)
ACK that complex systems involve bugs. Xilinx undoubtedly has its share. However, your approach of "jiggling the box" in the hope that a functional .bit emerges is not productive in this case nor a good long term strategy.
In the case of non-determinism, Xilinx explicitly requests users to bring to their attention non-deterministic behavior of their tool chain. They also give a lot of advice on how to diagnose the situation. Have you tried finding a minimal, reproducible test case and engaging with Xilinx?
If the .bit are suspected to differ, a starting point is compare run-to-run hashes of the .bit generated by Vivado.
Have you tried turning off write-leveling?
Have you tried reducing the DDR3 clock speed?
Have you considered other variables outside of .bit itself that could cause MiSoC test variation?
Have you tried turning off write-leveling?
Yes. I even went one step further and turned off the DDR3 completely: problem solved!
I need to have a closer look at that. I'll try to look at that this evening.
The read/write delays should be the same for both gateware versions, right (w/wo SAWG)? As a temporary hack, would it make sense to just hard code the numbers to the values found without the SAWG?
The read/write delays should be the same for both gateware versions, right (w/wo SAWG)?
In theory, yes.
As a temporary hack, would it make sense to just hard code the numbers to the values found without the SAWG?
Can try, but it's a shot in the dark - the root cause is not necessarily a failure of the timing calibration program (I think it's even unlikely to be the problem).
Okay.
@sbourdeauducq what's the plan for this? I'm itching to try Sayma with the SAWG and AFAICT this is currently the only thing stopping me.
I don't know what is going on. Did you try those binaries on your board? They work on mine: https://github.com/m-labs/artiq/issues/908#issuecomment-364132304 Maybe try with the other DDR bank (the 32-bit one) or try using fewer bits on the main one (simply removing DQ/DQS/DM pins in Migen does it).
It could be related to this: https://github.com/m-labs/misoc/issues/75
Building .bit from source using
I see...