Closed jbqubit closed 6 years ago
@hartytp, @sbourdeauducq: i still need to have a look. You can do what @sbourdeauducq is suggesting. I'm also wondering if we don't have a problem on write_leveling. I remembered during my tests that commenting write_leveling was making things work, you can eventually give a try until i have a look at that: https://github.com/m-labs/artiq/blob/7429ee4fb63316b05da07407d6802670ebdb80fd/artiq/firmware/libboard/sdram.rs#L248
I don't see how it can work without write leveling (unless only one or a few chips are used), there is definitely skew that needs to be compensated for.
We do seem to have a problem with IOSERDES clocking though, which produces the warning and the error in the timing report, and is a plausible explanation for those symptoms. I don't understand why, though: CLK is sys4x and CLKDIV is sys, which are generated by the same PLL and without phase offset. Or am I missing something?
@sbourdeauducq: I think Vivado generates this warning since we are applying a constraint on clocks at the output of the PLL. Vivado is then no longer propagating constraints from the input of the PLL to the outputs (ie that sys4x and sys are generated from the same PLL) and generates this warning.
Did you try those binaries on your board? They work on mine: #908 (comment)
Not yet, but happy to do that next week. Remind me what the difference between them and the ones I'm generating from the latest ARTIQ commit is?
Still, I'd rather not get into a situation where I'm reliant on asking for magic binaries to be able to do anything.
Maybe try with the other DDR bank (the 32-bit one) or try using fewer bits on the main one (simply removing DQ/DQS/DM pins in Migen does it).
Do you have a patch to do that? I'm not up to speed on that part of the code.
In any case, we should try to get both banks working fully before moving on to Sayma v2.0 in case we decide that hardware changes are needed.
We do seem to have a problem with IOSERDES clocking though, which produces the warning and the error in the timing report, and is a plausible explanation for those symptoms. I don't understand why, though: CLK is sys4x and CLKDIV is sys, which are generated by the same PLL and without phase offset. Or am I missing something?
Yes, fixing the timing warnings/errors sounds like a good starting point.
Confirmed that this is the source of the warning. Thanks. https://github.com/m-labs/misoc/commit/d1489edfb8ebba57320fae9157c44aa0e1c1b783 There is a similar issue with serwb.
@hartytp: if you want to use the 32 bits sdram, use sdram="ddram_32"
here:
https://github.com/m-labs/misoc/blob/master/misoc/targets/sayma_amc.py#L119
@enjoy-digital thanks, I'll try that next week. I'd still like to see a proper fix of this at some point though.
Can we now program the RTM FPGA from the AMC via artiq_flash?
No, and this is unrelated to the SDRAM. Edit: Okay, I see why you're asking - it's because of the commit I reverted. It only seems to be "fixing" the problem on the HKG board, so it doesn't make much sense to keep that core disabled.
No, and this is unrelated to the SDRAM.
Sorry, this isn't the right place to ask about this.
Edit: Okay, I see why you're asking - it's because of the commit I reverted. It only seems to be "fixing" the problem on the HKG board, so it doesn't make much sense to keep that core disable
Right. Anyway, I'd lost track of how far you'd got on RTM programming, and wasn't sure how much that core did.
For what it's worth, SDRAM is now working on the HKG board with SAWG, using the current ARTIQ master (41adbef9) which includes the RTM loading core and the clock constraint fix in MiSoC. It used not to work with recent gateware when the SAWG and the loading core were both present. But since this is so random, we cannot conclude yet that the clock constraint fix solved it.
And, on the other hand, serwb is now broken. Again this doesn't make any sense, since Vivado is deriving itself the constraints I removed:
clk50 {0.000 10.000} 20.000 50.000
standalone_pll_clk200 {0.000 2.500} 5.000 200.000
standalone_pll_eth_txclk {2.000 6.000} 8.000 125.000
standalone_pll_fb {0.000 10.000} 20.000 50.000
standalone_pll_sys {0.000 4.000} 8.000 125.000
serwb_pll_pll_fb {0.000 8.000} 16.000 62.500
serwb_pll_pll_serwb_serdes_20x_clk {0.000 1.600} 3.200 312.500
serwb_pll_pll_serwb_serdes_20x_clk_INTERNAL_DIVCLK {0.000 6.400} 12.800 78.125
serwb_pll_pll_serwb_serdes_5x_clk {0.000 6.400} 12.800 78.125
serwb_pll_pll_serwb_serdes_clk {0.000 32.000} 64.000 15.625
standalone_pll_sys4x {0.000 1.000} 2.000 500.000
standalone_pll_sys4x_INTERNAL_DIVCLK {0.000 4.000} 8.000 125.000
And, on the other hand, serwb is now broken.
....but it works with SAWG disabled. Looks like a very similar issue to the one crippling the SDRAM.
Two things:
standalone_pll_sys4x
. There are no timing violations other than TPWS.
Timing report without SAWGAnd the Pulse width check errors are on Max Skew of OSERDESE3/CLK wrt OSERDESE3/CLKDIV.
Thanks for digging into this. If you have a go at fixing the IOSERDES clocking, I'm happy to see if this resolves the issues on my board.
Done. Everything seems to be working now, both with and without SAWG. Please test on your boards. While reviewing clocking, I found several issues with serwb (see IRC log) that will have to be addressed as well.
Thanks! Will test on Monday.
@enjoy-digital Not sure what your current priorities are, but it would really help me if you could prioritise fixing the serwb issues that @sbourdeauducq highlighted this week.
Everything seems to be working now,
Only on Florent's board. On Sayma-3, with the same binaries (and SAWG disabled), the SDRAM is still broken.
:( any idea why?
@sbourdeauducq did you see anything else suspicious in the timing report? Particularly, anything connected with the SDRAM?
hmmm....
The critical warnings are now gone from the compilation, but I still see:
__ __ _ ____ ____
| \/ (_) ___| ___ / ___|
| |\/| | \___ \ / _ \| |
| | | | |___) | (_) | |___
|_| |_|_|____/ \___/ \____|
`MiSoC Bootloader` `Copyright (c) 2017 M-Labs Limited`
Bootloader CRC passed
Initializing SDRAM...
Write leveling: 28 65 62 74 51 59 20 25 done
Read delays: 7:00-123 6:00-129 5:16-32 4:29-45 3:64-80 2:60-77 1:93-111 0:96-112 done
SDRAM initialized
Memory test failed (487010/1114624 words incorrect)
Halting.
edit: that's with SAWG.
Any suggestions for next moves?
Edit: happy to try your binaries if you send them to me.
hartytp: can you try that? https://github.com/m-labs/artiq/issues/908#issuecomment-366426623 (I just want to know if issue could be related to write leveling)
I don't see anything suspicious in the log or timing report.
@enjoy-digital Will report back when that builds. Without the write leveling, do you expect the SDRAM to work at all (except by chance)?
@sbourdeauducq Any other ideas?
@hartytp: BTW for the test i'm asking, you just need to rebuild and flash the bootloader. I already had a case where disabling write_leveling was making things working. If this test make things working, then we could suspect something on the write leveling sequence. If not, then we cannot say anything.
Without SAWG:
MiSoC Bootloader
Copyright (c) 2017 M-Labs Limited
Bootloader CRC passed
Initializing SDRAM...
Write leveling: 85 119 104 117 82 107 66 79 done
Read delays: 7:25-223 6:44-256 5:104-281 4:104-122 3:150-166 2:173-357 1:185-201 0:198-381 done
SDRAM initialized
Memory test failed (341717/1114624 words incorrect)
Halting.
With SAWG, but I only flashed gateware:
MiSoC Bootloader
Copyright (c) 2017 M-Labs Limited
Bootloader CRC passed
Initializing SDRAM...
Write leveling: 83 121 107 121 87 114 78 76 done
Read delays: 7:32-234 6:53-251 5:105-309 4:119-329 3:167-350 2:175-365 1:201-389 0:200-382 done
SDRAM initialized
Memory test passed
Booting from flash...
Starting firmware.
[ 0.000005s] INFO(runtime): ARTIQ runtime starting...
[ 0.003861s] INFO(runtime): software version 4.0.dev+521.g4c22d64e
[ 0.010122s] INFO(runtime): gateware version 4.0.dev+516.g0edc34a9
[ 0.016386s] INFO(runtime): log level set to INFO by default
[ 0.022106s] INFO(runtime): UART log level set to INFO by default
[ 0.028265s] INFO(runtime): press 'e' to erase startup and idle kernels...
[ 1.028006s] INFO(runtime): continuing boot
[ 1.030970s] WARN(runtime): using default MAC address 02-00-00-00-76-01; consider changing it
[ 1.039559s] INFO(runtime): using default IP address 192.168.1.60
[ 1.045805s] ERROR(runtime::rtio_mgt): unrecognized startup_clock configuration entry, using internal RTIO clock
[ 1.057326s] INFO(runtime::mgmt): management interface active
[ 1.070505s] INFO(runtime::session): accepting network sessions
[ 1.084606s] INFO(runtime::session): running startup kernel
[ 1.089081s] INFO(runtime::session): no startup kernel found
[ 1.094792s] INFO(runtime::session): no connection, starting idle kernel
[ 1.101661s] INFO(runtime::session): no idle kernel found
panic at /home/sb/.cargo/git/checkouts/smoltcp-ebf9e93b1271bd34/181083f/src/socket/mod.rs:115: internal error: entered unreachable code
backtrace for software version 4.0.dev+521.g4c22d64e:
0x40022f3c
0x4003dbf4
0x4003dac8
0x4003200c
0x4002007c
0x40022a80
halting.
use `artiq_coreconfig write -s panic_reset 1` to restart instead
When I used artiq_flash:
MiSoC Bootloader
Copyright (c) 2017 M-Labs Limited
Bootloader CRC passed
Initializing SDRAM...
Write leveling: 81 114 104 112 81 112 77 78 done
Read delays: 7:27-44 6:52-247 5:105-300 4:113-129 3:155-174 2:153-170 1:178-194 0:190-207 done
SDRAM initialized
Memory test failed (386724/1114624 words incorrect)
Halting.
@enjoy-digital:
__ __ _ ____ ____
| \/ (_) ___| ___ / ___|
| |\/| | \___ \ / _ \| |
| | | | |___) | (_) | |___
|_| |_|_|____/ \___/ \____|
`MiSoC Bootloader` `Copyright (c) 2017 M-Labs Limited`
Bootloader CRC passed
Initializing SDRAM...
Write leveling: 46 85 81 94 70 79 35 42 done
Read delays: 7:00-168 6:07-180 5:51-224 4:63-239 3:100-258 2:100-266 1:128-286 0:137-295 done
SDRAM initialized
Memory test passed
Booting from flash...
Starting firmware.
[ 0.000004s] INFO(runtime): ARTIQ runtime starting...
[ 0.003865s] INFO(runtime): software version 4.0.dev+575.g0f454965
[ 0.010131s] INFO(runtime): gateware version 4.0.dev+575.g0f454965
[ 0.016391s] INFO(runtime): log level set to INFO by default
[ 0.022112s] INFO(runtime): UART log level set to INFO by default
[ 0.028266s] INFO(board_artiq::serwb): waiting for AMC/RTM serwb bridge to be ready...
[ 14.436589s] INFO(board_artiq::serwb): done.
[ 14.439721s] INFO(board_artiq::serwb): RTM gateware version 4.0.dev+575.g0f454965
[ 14.447188s] INFO(runtime): press 'e' to erase startup and idle kernels...
[ 15.447005s] INFO(runtime): continuing boot
That's with SAWG, and with write leveling commented out. Seems to freeze there.
Hmmm...Normally, I've run artiq_flash ... start
and then promptly run the openocd script to load the RTM gateware. That used to work fine. Now, I seem to have to run the openocd script after misoc boots or it prints [ 0.028266s] INFO(board_artiq::serwb): waiting for AMC/RTM serwb bridge to be ready...
once and hangs there. After running the openocd script, it gets to "continuing to boot" and then crashes. (Well, sometimes I get a partial line on the UART, like:
[ 0.022112s] INFO(runtime): UART log level set to INFO by default
[ 0.028266s] INFO(board_artiq::serwb): waiting for AMC/RTM serwb bridge to be ready...
[ 153.578137s] INFO(board_artiq::serwb): done.
[ 153.581281s] INFO(board_artiq::serwb): RTM gateware version 4.0.dev+575.g0f454965
[ 153.588754s] INFO(runtime): press 'e' to erase startup and idle kernels...
[ 154.588017s] INFO(runtime): continuing boot
[154.
@hartytp : not sure that's without write_leveling since there is the write_leveling prompt.
@enjoy-digital apologies, forgot to save changes. Hmmm...that was rebuilding with exactly the same code as this morning, but gave different results (no memory errors, but now freezes later on during boot).
| \/ (_) ___| ___ / ___|
| |\/| | \___ \ / _ \| |
| | | | |___) | (_) | |___
|_| |_|_|____/ \___/ \____|
MiSoC Bootloader
Copyright (c) 2017 M-Labs Limited
Bootloader CRC passed
Initializing SDRAM...
Read delays: 7:00-166 6:08-180 5:53-213 4:64-238 3:101-246 2:99-268 1:125-283 0:133-293 done
SDRAM initialized
Memory test passed
Booting from flash...
Starting firmware.
[ 0.000005s] INFO(runtime): ARTIQ runtime starting...
[ 0.003866s] INFO(runtime): software version 4.0.dev+575.g0f454965
[ 0.010132s] INFO(runtime): gateware version 4.0.dev+575.g0f454965
[ 0.016392s] INFO(runtime): log level set to INFO by default
[ 0.022113s] INFO(runtime): UART log level set to INFO by default
[ 0.028267s] INFO(board_artiq::serwb): waiting for AMC/RTM serwb bridge to be ready...
[ 0.766440s] WARN(board_artiq::serwb): AMC/RTM serwb bridge initialization failed, retrying.
[ 1.570117s] WARN(board_artiq::serwb): AMC/RTM serwb bridge initialization failed, retrying.
[ 2.267972s] WARN(board_artiq::serwb): AMC/RTM serwb bridge initialization failed, retrying.
[ 3.037240s] WARN(board_artiq::serwb): AMC/RTM serwb bridge initialization failed, retrying.
[ 4.060218s] WARN(board_artiq::serwb): AMC/RTM serwb bridge initialization failed, retrying.
[ 4.827861s] INFO(board_artiq::serwb): done.
[ 4.830987s] INFO(board_artiq::serwb): RTM gateware version 4.0.dev+575.g0f454965
[ 4.838460s] INFO(runtime): press 'e' to erase startup and idle kernels...
[ 5.838006s] INFO(runtime): continuing boot
[ 5.
So write leveling makes no difference at the moment, since it crashes somewhere else.
1.8V bug?
Not sure if it's connected, but when I was playing around with the HMC830 I found that Sayma was prone to crashing. IIRC, I was checking to see if some part of the HMC830 startup process needed some time (e.g. after the power on reset) so I was adding delays followed by register dumps at various points in the initialization sequence. Some things I did would cause it to crash (although I didn't have time to track down a minimum reproducible example). Often it would crash mid-way through printing something to the UART.
1.8V bug?
Sadly, all power LEDs on Sayma AMC + RTM look green and happy. Specifically, the 1V8 LED on Sayma AMC is on (I've never seen it go off while the board is powered).
It does still respond to artiq_flash ... start
And how much noise do you have on the 1.8V rail? IME, I had more unexplained crashes before I added the capacitor on the 1.8V rail. Though unreliable SDRAM could cause those as well.
hmmm... just ran artiq_flash ... start
a few times and saw:
| \/ (_) ___| ___ / ___|
| |\/| | \___ \ / _ \| |
| | | | |___) | (_) | |___
|_| |_|_|____/ \___/ \____|
MiSoC Bootloader
Copyright (c) 2017 M-Labs Limited
Bootloader CRC passed
Initializing SDRAM...
Read delays: 7:00-164 6:04-177 5:48-64 4:60-237 3:96-257 2:101-263 1:123-284 0:133-290 done
SDRAM initialized
Memory test failed (2116/1114624 words incorrect)
Halting.
__ __ _ ____ ____
| \/ (_) ___| ___ / ___|
| |\/| | \___ \ / _ \| |
| | | | |___) | (_) | |___
|_| |_|_|____/ \___/ \____|
MiSoC Bootloader
Copyright (c) 2017 M-Labs Limited
Bootloader CRC passed
Initializing SDRAM...
Read delays: 7:00-162 6:05-177 5:47-220 4:59-231 3:98-256 2:98-263 1:116-132 0:136-292 done
SDRAM initialized
Memory test failed (15589/1114624 words incorrect)
Halting.
__ __ _ ____ ____
| \/ (_) ___| ___ / ___|
| |\/| | \___ \ / _ \| |
| | | | |___) | (_) | |___
|_| |_|_|____/ \___/ \____|
MiSoC Bootloader
Copyright (c) 2017 M-Labs Limited
Bootloader CRC passed
Initializing SDRAM...
Read delays: 7:00-143 6:08-183 5:50-217 4:62-236 3:65-81 2:101-263 1:127-289 0:136-295 done
SDRAM initialized
Memory test failed (384490/1114624 words incorrect)
Halting.
And how much noise do you have on the 1.8V rail? IME, I had more unexplained crashes before I added the capacitor on the 1.8V rail. Though unreliable SDRAM could cause those as well.
haven't stuck a scope on it, but a DVM shows 1.80V on it. I don't think that's the problem here.
But, I do agree that having issues like the 1V8 bug on Sayma does add one more variable to the equation and makes it harder to track down other bugs. @gkasprow @marmeladapk now this shows up on your board as well, please can you try to fix this, as it's been causing issues for months now!
hmmmm...running artiq_flash ... start
a few times, I'm getting mainly memory errors, but sometimes I'm getting to "continuing to boot" before it crashes.
...sigh...
Well, let me know if there is anything else you can think of for me to try.
@enjoy-digital I commented out write leveling and I got warnings during compilation that this function is never used. However write leveling still shows up in boot messages (and memory check fails).
/*if !write_level(logger, &mut delay, &mut high_skew) {
return false
}*/
The boot messages are displayed by the read_bitslip
and read_delays
functions. Memory check likely fails because write leveling exists for a reason.
@enjoy-digital did you have a chance to fix the timing issues with serwb that @sbourdeauducq mentioned? I know that these are unlikely to be the cause of the SDRAM issues, but it's probably a good idea to mop up all the known issues as a starting point to fixing this.
@sbourdeauducq any ideas about how to move forward with this issue? It seems unlikely that this could be related to any known hardware bug with Sayma.
I'd like to see if it's an issue with the read leveling algorithm or something else, but i'm not able to reproduce the issue on the HK boards. Can someone that is able to reproduce the issue apply this patch: https://hastebin.com/akoketutut.swift, rebuild the bootloader (use --no-compile-gateware), re-flash the bootloader (artiq_flash ... bootloader) and post the results? Thanks.
@hartytp: we see on your last 3 failing capture that there are still delay intervasl that are too small (<20).
@enjoy-digital
MiSoC Bootloader
Copyright (c) 2017 M-Labs Limited
Bootloader CRC passed
Initializing SDRAM...
Write leveling: 86 122 110 126 92 121 81 85 done
Read delays:
Module 0:
...................................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXX......................................................................
................................................................................
................................................................................
................................
Module 1:
......................................................X...XXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXX......................................................
................................................................................
................................................................................
................................
Module 2:
................................................................................
...........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX...
................................................................................
................................................................................
................................
Module 3:
................................................................................
..........................................XX.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXX.......................................................................
................................................................................
................................
Module 4:
................................................................................
................................................................................
....XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.....................................
................................................................................
................................
Module 5:
................................................................................
................................................................................
...........XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX..............................
................................................................................
................................
Module 6:
................................................................................
................................................................................
...................................X.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX.
................................................................................
................................
Module 7:
................................................................................
................................................................................
................................................X.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
X.X.............................................................................
................................
done
SDRAM initialized
Memory test failed (1113922/1114624 words incorrect)
Halting.
@hartytp: we see on your last 3 failing capture that there are still delay intervasl that are too small (<20).
Do you want me to try increasing the initial SDRAM delay from 16 to 24? (Here https://github.com/m-labs/artiq/blob/f060d6e1b337b75f65590337063f1a7d00109a23/artiq/firmware/libboard/sdram.rs#L210 IIRC).
Building .bit from source using
I see...