Closed hartytp closed 6 years ago
Rebuilt without the jump(0)
and flashed that as a startup kernel.
~No crash: ~ I also see that Kernel crash.
__ __ _ ____ ____
| \/ (_) ___| ___ / ___|
| |\/| | \___ \ / _ \| |
| | | | |___) | (_) | |___
|_| |_|_|____/ \___/ \____|
MiSoC Bootloader
Copyright (c) 2017-2018 M-Labs Limited
Bootloader CRC passed
Gateware ident 4.0.dev+1148.gf385add8.dirty
Initializing SDRAM...
DQS initial delay: 97 taps
Write leveling scan:
Module 3:
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 2:
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 1:
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000010101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 0:
0000000000000000000000000000000000000000000000000000000000000000000000000010011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111100010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
DQS initial delay: 97 taps
Write leveling: 75 87 103 102 done
Read leveling scan:
Module 3:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111011101000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 2:
00000000000000000000000000000000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 1:
00000000000000000000000000000000000000000000000000000000000000000010111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111100100100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 0:
00000000000000000000000000000000000000000000000000000000000001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Read leveling: 170+-74 161+-84 148+-80 140+-78 done
SDRAM initialized
Memory test passed
Booting from flash...
Starting firmware.
[ 0.000007s] INFO(runtime): ARTIQ runtime starting...
[ 0.003882s] INFO(runtime): software version 4.0.dev+1148.gf385add8
[ 0.010235s] INFO(runtime): gateware version 4.0.dev+1148.gf385add8.dirty
[ 0.017122s] INFO(runtime): log level set to INFO by default
[ 0.022833s] INFO(runtime): UART log level set to INFO by default
[ 0.028978s] INFO(board_artiq::serwb): waiting for AMC/RTM serwb bridge to be ready...
[ 0.063705s] INFO(board_artiq::serwb): done.
[ 0.066736s] INFO(board_artiq::serwb): RTM to AMC Link test
[ 1.548778s] INFO(board_artiq::serwb): 0 errors
[ 1.552082s] INFO(board_artiq::serwb): AMC to RTM Link test
[ 3.034130s] INFO(board_artiq::serwb): 0 errors
[ 3.037428s] INFO(board_artiq::serwb): Wishbone test...
[ 4.969266s] INFO(board_artiq::serwb): 0 errors
[ 4.972572s] INFO(board_artiq::serwb): AMC serwb settings:
[ 4.978120s] INFO(board_artiq::serwb): bitslip: 25
[ 4.983157s] INFO(board_artiq::serwb): ready: 1
[ 4.987933s] INFO(board_artiq::serwb): error: 0
[ 4.992708s] INFO(board_artiq::serwb): RTM serwb settings:
[ 4.998273s] INFO(board_artiq::serwb): bitslip: 20
[ 5.003310s] INFO(board_artiq::serwb): ready: 1
[ 5.008086s] INFO(board_artiq::serwb): error: 0
[ 5.013157s] INFO(board_artiq::serwb): RTM gateware version 4.0.dev+1148.gf385add8.dirty
[ 5.021022s] INFO(runtime): press 'e' to erase startup and idle kernels...
[ 6.021006s] INFO(runtime): continuing boot
[ 6.024066s] INFO(board_artiq::hmc830_7043::hmc830): HMC830 found
[ 6.030134s] INFO(board_artiq::hmc830_7043::hmc830): loading HMC830 configuration...
[ 6.038181s] INFO(board_artiq::hmc830_7043::hmc830): ...done
[ 6.043837s] INFO(board_artiq::hmc830_7043::hmc830): setting HMC830 dividers...
[ 6.051299s] INFO(board_artiq::hmc830_7043::hmc830): ...done
[ 6.057122s] INFO(board_artiq::hmc830_7043::hmc830): waiting for HMC830 lock...
[ 6.064540s] INFO(board_artiq::hmc830_7043::hmc830): ...locked
[ 6.070582s] INFO(board_artiq::hmc830_7043::hmc7043): enabling hmc7043
[ 6.077457s] INFO(board_artiq::hmc830_7043::hmc7043): HMC7043 found
[ 6.083536s] INFO(board_artiq::hmc830_7043::hmc7043): loading configuration...
[ 6.092816s] INFO(board_artiq::hmc830_7043::hmc7043): ...done
[ 6.097871s] INFO(board_artiq::ad9154): AD9154-0 found
[ 6.113009s] INFO(board_artiq::ad9154): AD9154-0 initializing...
[ 6.125898s] INFO(board_artiq::ad9154): ...done
[ 6.199783s] INFO(board_artiq::ad9154): AD9154-0 running PRBS test...
[ 7.206158s] INFO(board_artiq::ad9154): ...passed
[ 7.209817s] INFO(board_artiq::ad9154): AD9154-0 running STPL test...
[ 7.216606s] INFO(board_artiq::ad9154): c0 errors: 0
[ 7.221810s] INFO(board_artiq::ad9154): c1 errors: 0
[ 7.227020s] INFO(board_artiq::ad9154): c2 errors: 0
[ 7.232230s] INFO(board_artiq::ad9154): c3 errors: 0
[ 7.237153s] INFO(board_artiq::ad9154): ...passed
[ 7.242102s] INFO(board_artiq::ad9154): AD9154-0 SYSREF scan:
[ 7.522189s] INFO(board_artiq::ad9154): phase: 26, sync error: 481
[ 8.177445s] INFO(board_artiq::ad9154): phase: 90, sync error: 482
[ 8.182583s] INFO(board_artiq::ad9154): phase min: Some(26), phase max: Some(89)
[ 8.190225s] INFO(board_artiq::ad9154): AD9154-0 setting SYSREF phase to 88
[ 8.207610s] INFO(board_artiq::ad9154): AD9154-0 initializing...
[ 8.220463s] INFO(board_artiq::ad9154): ...done
[ 8.294670s] INFO(board_artiq::ad9154): AD9154-1 found
[ 8.308868s] INFO(board_artiq::ad9154): AD9154-1 initializing...
[ 8.321826s] INFO(board_artiq::ad9154): ...done
[ 8.395699s] INFO(board_artiq::ad9154): AD9154-1 running PRBS test...
[ 9.402066s] INFO(board_artiq::ad9154): ...passed
[ 9.405724s] INFO(board_artiq::ad9154): AD9154-1 running STPL test...
[ 9.412509s] INFO(board_artiq::ad9154): c0 errors: 0
[ 9.417718s] INFO(board_artiq::ad9154): c1 errors: 0
[ 9.422928s] INFO(board_artiq::ad9154): c2 errors: 0
[ 9.428138s] INFO(board_artiq::ad9154): c3 errors: 0
[ 9.433060s] INFO(board_artiq::ad9154): ...passed
[ 9.438009s] INFO(board_artiq::ad9154): AD9154-1 SYSREF scan:
[ 9.707938s] INFO(board_artiq::ad9154): phase: 25, sync error: 481
[ 10.363189s] INFO(board_artiq::ad9154): phase: 89, sync error: 482
[ 10.368326s] INFO(board_artiq::ad9154): phase min: Some(25), phase max: Some(88)
[ 10.375967s] INFO(board_artiq::ad9154): AD9154-1 setting SYSREF phase to 88
[ 10.393355s] INFO(board_artiq::ad9154): AD9154-1 initializing...
[ 10.406314s] INFO(board_artiq::ad9154): ...done
[ 10.480177s] INFO(board_artiq::hmc542): card 0 channel 0 set to 4 dB
[ 10.487427s] INFO(board_artiq::hmc542): card 0 channel 1 set to 4 dB
[ 10.494666s] INFO(board_artiq::hmc542): card 1 channel 0 set to 4 dB
[ 10.501906s] INFO(board_artiq::hmc542): card 1 channel 1 set to 4 dB
[ 10.509146s] INFO(board_artiq::hmc542): card 2 channel 0 set to 4 dB
[ 10.516386s] INFO(board_artiq::hmc542): card 2 channel 1 set to 4 dB
[ 10.523625s] INFO(board_artiq::hmc542): card 3 channel 0 set to 4 dB
[ 10.530865s] INFO(board_artiq::hmc542): card 3 channel 1 set to 4 dB
[ 10.538137s] WARN(runtime): using default MAC address 02-00-00-00-00-01; consider changing it
[ 10.545546s] INFO(runtime): using default IP address 192.168.1.50
[ 10.553297s] INFO(runtime::mgmt): management interface active
[ 10.566496s] INFO(runtime::session): accepting network sessions
[ 10.580699s] INFO(runtime::session): running startup kernel
[ 10.609958s] INFO(runtime::kern_hwreq): resetting RTIO
[ 10.615796s] INFO(kernel): 10000000.000000
[ 10.622048s] INFO(kernel): 10141414.141414
[ 10.628290s] INFO(kernel): 10282828.282828
[ 10.634519s] INFO(kernel): 10424242.424242
[ 10.640768s] INFO(kernel): 10565656.565656
[ 10.647010s] INFO(kernel): 10707070.707070
[ 10.653258s] INFO(kernel): 10848484.848484
[ 10.659508s] INFO(kernel): 10989898.989898
[ 10.665755s] INFO(kernel): 11131313.131313
[ 10.672004s] INFO(kernel): 11272727.272727
[ 10.678253s] INFO(kernel): 11414141.414141
[ 10.684495s] INFO(kernel): 11555555.555555
[ 10.690749s] INFO(kernel): 11696969.696969
[ 10.696997s] INFO(kernel): 11838383.838383
[ 10.703349s] INFO(kernel): 11979797.979797
[ 10.709604s] INFO(kernel): 12121212.121212
[ 10.715959s] INFO(kernel): 12262626.262626
[ 10.722203s] INFO(kernel): 12404040.404040
[ 10.728553s] INFO(kernel): 12545454.545454
[ 10.734901s] INFO(kernel): 12686868.686868
[ 10.741249s] INFO(kernel): 12828282.828282
[ 10.747599s] INFO(kernel): 12969696.969696
NB I haven't built with @jordens latest finds yet.
I take that back.
A while later, I got some \FF printed to the UART.
Then, I rebooted and the Mem test gave all 1s
~@sbourdeauducq what do you expect to see on the UART if that Kernel runs correctly?~
hmm...after a reboot, the kernel ran (same output on UART), 5 minutes later, no crash afaict. Re loading the AMC FPGA with artiq_flash -t sayma ... start
mem test looks good, and the Kernel runs again with the same output.
... a while later, and @
appeared on the UART. But, artiq_flash ... start
still booted correctly afterwards.
ok, same crash here, but it happens faster...
Can you try running the reboot loops longer? @gkasprow Can you test this too? Any idea why this happens?
I tried running the crash kernel on the Sayma DRTIO master, which does not use the HMC clock chips (or anything on the RTM). It does NOT crash. Could it be a power integrity problem?
Tried adding some of the problematic buffers to the DRTIO master - still no crash when running the kernel.
@@ -301,6 +299,22 @@ class Master(MiniSoC, AMPSoC):
self.submodules += Microscope(platform.request("serial", 1),
self.clk_freq)
+ self.clock_domains.cd_jesd = ClockDomain()
+ refclk_pads = platform.request("dac_refclk", 0)
+ refclk2 = Signal()
+ platform.add_period_constraint(refclk_pads.p, 1e9/150e6)
+ self.specials += [
+ Instance("IBUFDS_GTE3", i_CEB=0, p_REFCLK_HROW_CK_SEL=0b00,
+ i_I=refclk_pads.p, i_IB=refclk_pads.n,
+ o_ODIV2=refclk2),
+ Instance("BUFG_GT", i_I=refclk2, o_O=self.cd_jesd.clk)
+ ]
+ blink_counter = Signal(28)
+ blink = Signal()
+ self.sync.jesd += blink_counter.eq(blink_counter + 1)
+ self.comb += blink.eq(blink_counter[-1])
+ self.submodules += add_probe_async("blink", "blink", blink)
+
# Si5324 used as a free-running oscillator, to avoid dependency on RTM.
self.submodules.si5324_rst_n = gpio.GPIOOut(platform.request("si5324").rst_n)
self.csr_devices.append("si5324_rst_n")
The blink signal behaves erratically however (with the RTM FPGA freshly loaded) - maybe the rework on my board is broken and the counter is "clocked" by the 7043 noise. How is it on other boards?
No crash either when the kernel is run on a --without-sawg
build. @gkasprow @hartytp @jbqubit Can you reproduce this?
This starts to look like PI or a Vivado issue...
@gkasprow Can you measure the supply voltages as close as possible to the FPGA, with the SAWG running?
No crash either when the kernel is run on a --without-sawg build. @gkasprow @hartytp @jbqubit Can you reproduce this?
I'll have a look at that on Monday.
or a Vivado issue...
Try building with 2017.x?
The blink signal behaves erratically however (with the RTM FPGA freshly loaded) - maybe the rework on my board is broken and the counter is "clocked" by the 7043 noise. How is it on other boards?
Did you disabling the firmware line that sets the HMC7043 RESET LOW in the standalone build? If that doesn't stop the DAC PLLs locking then there is something wrong with your rework and all bets are off.
Also, are you doing this after @jordens latest round of reviews, catching some incorrect logic levels/terminations/etc?
Bad PI would also explain the trashing of the whole FPGA by the 7043. the noise would create increased switching and power consumption...
PI was carefully simulated so only some transients on supply rails could be source of the issues. I'll look at the power rails on Tuesday and also try to reproduce the problem.
@sbourdeauducq We'll build Artiq with your patch and check on Tuesday.
Same behavior with Vivado 2017.4 (meets timing, crashes with the test kernel).
I'll look at the power rails on Tuesday and also try to reproduce the problem.
@gkasprow What did you find?
@gkasprow @marmeladapk might also be worth double checking that we really have followed all Xilinx user guides on power supplies, decoupling, etc and that that decoupling capacitors have the correct voltage rating etc.
One workaround is to run the SAWG kernels over DRTIO. Both the Sayma DRTIO master and satellite seem surprisingly unaffected by the crashes, so far. If the master crashes it can in theory be replaced by Kasli. For using the Sayma DRTIO satellite, connect a coax cable between the clock output SMA on the AMC and the clock input SMA on the RTM.
With boot::jump(0) Sayma goes through one full init sequence, reboots and then hangs after enabling hmc7043
is printed. Reproduced across many manual restarts and power cycling. Nothing out of the ordinary happens in memtests. Log
@marmeladapk can you look at the HMC7043 outputs on the UFL connectors and see what the signal there looks like during this test?
Can you try removing the AC coupling capacitors that connect the HMC7043 to the FPGAs and see if this fixes the issue? If so, can you try adding the FET switches to the HMC7043.
I will test it in a moment.
Note to self: one potentially significant difference between the DRIO and standalone Sayma builds is how the RTIO is clocked/reset. In standalone, the RTIO logic is clocked from the HMC7043 output and reset by the same signal that enables the clock input buffers:
So, the DAC init code also brings the RTIO core out of reset.
@sbourdeauducq should there be some delay between enabling the clock input buffers and bringing the RTIO core and other logic out of reset? Should we have two different CSRs? One to enable the CBs and one for the RTIO/JESD reset line.
I removed caps that connect HMC7043 with AMC FPGA: ADC SYSREF DAC SYSREF GTP1 GTP2 AMC MASTER AUX
@hartytp It made it worse, now sayma always locks up on hmc830.
I don't see any link between removing the caps and hmc830 locking
Can you post logs?
I'm not sure I follow you. Do you mean that the board now locks up during HMC830 in it??
@hartytp With boot:jump(0) sayma would do one full cycle from memtest to dac init, then reboot and hang on hmc7043 init. Now it always hangs on hmc830 lock.
ed: log
That's not hanging, that's the HMC830 not locking.
@sbourdeauducq True, sorry for confusion.
@hartytp It may be a red herring - now it fails to identify
invalid HMC830 ID: 0x00797500
panic at src/libcore/result.rs:945:5: cannot initialize HMC830/7043: "invalid HMC830 identification"
Perhaps the chip is failing.
the HMC has ID: 0xA7975 so it looks that A character is missing.
On forth code id is read correctly. But chip fails to lock.
Since it only happens on one RTM it seems to be hardware bug/failure. We'll continue discussion in https://github.com/sinara-hw/sinara/issues/472 .
Unfortunately it means we don't fully working setup ATM (on other RTM both DAC chips fail to init) so I cannot test memory in the same config as @sbourdeauducq . I saw that with today's build Artiq continues booting even if DACs are not properly initialised so I can test with that setup.
Did the DACs fail to init before you removed those resistors?
@hartytp yes.
@sbourdeauducq do you use FPGA_DAC_SYSREF from HMC7043 to AMC FPGA? It has very low amplitude, roughly 200mV while other signals are 1V or more. Another clock that is fed to the AMC FPGA (AMC_MASTER_AUX_CLK) is disabled, only GTP_CLK1 is on.
another question - I did not do any rework on HMC7043 reset pin. Does it affect DAC operation?
do you use FPGA_DAC_SYSREF from HMC7043 to AMC FPGA?
Yes. What can cause the lower amplitude?
another question - I did not do any rework on HMC7043 reset pin. Does it affect DAC operation?
It seems to makes the crashes worse.
Sysref is lvds others are lvpecl.
@gkasprow HMC7043 reset seems to help. Connect to FPGA and pull HIGH.
Can we set output of HMC7043 FPGA SYSREF to LVPECL? it is AC-terminated so it does not mater. At the moment it has two 200R to GND so the amplitude gets attenuated seriously. I'm worried about low amplitude.
@gkasprow two options:
Still I don't think that the crashes are related to a low sysref signal
I'm not worried about crashes but errors on JESD links. I observe them on two boards. @marmeladapk just modified registers.
I guess the load is quite high because we also use HMC7043 internal 100R termination which explains the small signal.
Do check the max diff signal for those 1v8 inputs
with LVPECL I observe 0.8Vpp on each output. so we won't break the LVDS input. But it did not help and I still get PRBS errors.
I reverted original HMC7043 reset, added 4k7pullup but this did not help.
Do you have the capacitors connecting the AMC to the HMC7043 depopulated?
Otherwise I'm not sure, haven't had that error.
https://drive.google.com/open?id=1ZyI_0S0IJ-oKc15RgisTW4FABeBQvuub