m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
437 stars 201 forks source link

Sayma: memory corruption #1065

Closed hartytp closed 6 years ago

hartytp commented 6 years ago
 __  __ _ ____         ____ 
|  \/  (_) ___|  ___  / ___|
| |\/| | \___ \ / _ \| |    
| |  | | |___) | (_) | |___ 
|_|  |_|_|____/ \___/ \____|

MiSoC Bootloader
Copyright (c) 2017-2018 M-Labs Limited

Bootloader CRC passed
Gateware ident 4.0.dev+1133.g0b086225
Initializing SDRAM...
DQS initial delay: 96 taps
Write leveling scan:
Module 3:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 2:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 1:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 0:
00000000000000000000000000000000000000000000000000000000000000000000000001000111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
DQS initial delay: 96 taps
Write leveling: 72 82 102 97 done
Read leveling scan:
Module 3:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 2:
00000000000000000000000000000000000000000000000000000000000000000000000000000101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 1:
00000000000000000000000000000000000000000000000000000000000001101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 0:
00000000000000000000000000000000000000000000000000000000111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Read leveling: 172+-75 159+-83 143+-81 136+-80 done
SDRAM initialized
Memory test passed

Booting from flash...
Starting firmware.
[     0.000007s]  INFO(runtime): ARTIQ runtime starting...
[     0.003881s]  INFO(runtime): software version 4.0.dev+1133.g0b086225
[     0.010232s]  INFO(runtime): gateware version 4.0.dev+1133.g0b086225
[     0.016597s]  INFO(runtime): log level set to INFO by default
[     0.022311s]  INFO(runtime): UART log level set to INFO by default
[     0.028454s]  INFO(board_artiq::serwb): waiting for AMC/RTM serwb bridge to be ready...
[     0.063183s]  INFO(board_artiq::serwb): done.
[     0.066213s]  INFO(board_artiq::serwb): RTM to AMC Link test
[     1.548256s]  INFO(board_artiq::serwb): 0 errors
[     1.551560s]  INFO(board_artiq::serwb): AMC to RTM Link test
[     3.033609s]  INFO(board_artiq::serwb): 0 errors
[     3.036907s]  INFO(board_artiq::serwb): Wishbone test...
[     4.968923s]  INFO(board_artiq::serwb): 0 errors
[     4.972230s]  INFO(board_artiq::serwb): AMC serwb settings:
[     4.977779s]  INFO(board_artiq::serwb):   bitslip: 10
[     4.982815s]  INFO(board_artiq::serwb):   ready: 1
[     4.987590s]  INFO(board_artiq::serwb):   error: 0
[     4.992366s]  INFO(board_artiq::serwb): RTM serwb settings:
[     4.997932s]  INFO(board_artiq::serwb):   bitslip: 35
[     5.002968s]  INFO(board_artiq::serwb):   ready: 1
[     5.007743s]  INFO(board_artiq::serwb):   error: 0
[     5.012754s]  INFO(board_artiq::serwb): RTM gateware version 4.0.dev+1133.g0b086225
[     5.020158s]  INFO(runtime): press 'e' to erase startup and idle kernels...
[     6.020007s]  INFO(runtime): continuing boot
[     6.023065s]  INFO(board_artiq::hmc830_7043::hmc830): HMC830 found
[     6.029133s]  INFO(board_artiq::hmc830_7043::hmc830): loading configuration...
[     6.036641s]  INFO(board_artiq::hmc830_7043::hmc830):   ...done
[     6.042230s]  INFO(board_artiq::hmc830_7043::hmc830): waiting for lock...
[     6.049036s]  INFO(board_artiq::hmc830_7043::hmc830):   ...locked
[     6.055080s]  INFO(board_artiq::hmc830_7043::hmc7043): enabling hmc7043
[     6.061955s]  INFO(board_artiq::hmc830_7043::hmc7043): HMC7043 found
[     6.068034s]  INFO(board_artiq::hmc830_7043::hmc7043): loading configuration...
[     6.077314s]  INFO(board_artiq::hmc830_7043::hmc7043):   ...done
@ panic at src/libcore/fmt/mod.rs:1096:40: index out of bounds: the len is 1 but the index is 1074179992
backtrace for software version 4.0.dev+1133.g0b086225:
0x40003334
0x4001513c
0x40015090
0x40015d4c
0x40002e30
0x40002c84
halting.
use `artiq_coreconfig write -s panic_reset 1` to restart instead

https://drive.google.com/open?id=1ZyI_0S0IJ-oKc15RgisTW4FABeBQvuub

hartytp commented 6 years ago

Rebuilt without the jump(0) and flashed that as a startup kernel.

~No crash: ~ I also see that Kernel crash.


 __  __ _ ____         ____ 
|  \/  (_) ___|  ___  / ___|
| |\/| | \___ \ / _ \| |    
| |  | | |___) | (_) | |___ 
|_|  |_|_|____/ \___/ \____|

MiSoC Bootloader
Copyright (c) 2017-2018 M-Labs Limited

Bootloader CRC passed
Gateware ident 4.0.dev+1148.gf385add8.dirty
Initializing SDRAM...
DQS initial delay: 97 taps
Write leveling scan:
Module 3:
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 2:
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 1:
0000000000000000000000000000000000000000000000000000000000000000000000000000000000000010101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 0:
0000000000000000000000000000000000000000000000000000000000000000000000000010011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111100010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
DQS initial delay: 97 taps
Write leveling: 75 87 103 102 done
Read leveling scan:
Module 3:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111011101000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 2:
00000000000000000000000000000000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 1:
00000000000000000000000000000000000000000000000000000000000000000010111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111100100100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 0:
00000000000000000000000000000000000000000000000000000000000001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Read leveling: 170+-74 161+-84 148+-80 140+-78 done
SDRAM initialized
Memory test passed

Booting from flash...
Starting firmware.
[     0.000007s]  INFO(runtime): ARTIQ runtime starting...
[     0.003882s]  INFO(runtime): software version 4.0.dev+1148.gf385add8
[     0.010235s]  INFO(runtime): gateware version 4.0.dev+1148.gf385add8.dirty
[     0.017122s]  INFO(runtime): log level set to INFO by default
[     0.022833s]  INFO(runtime): UART log level set to INFO by default
[     0.028978s]  INFO(board_artiq::serwb): waiting for AMC/RTM serwb bridge to be ready...
[     0.063705s]  INFO(board_artiq::serwb): done.
[     0.066736s]  INFO(board_artiq::serwb): RTM to AMC Link test
[     1.548778s]  INFO(board_artiq::serwb): 0 errors
[     1.552082s]  INFO(board_artiq::serwb): AMC to RTM Link test
[     3.034130s]  INFO(board_artiq::serwb): 0 errors
[     3.037428s]  INFO(board_artiq::serwb): Wishbone test...
[     4.969266s]  INFO(board_artiq::serwb): 0 errors
[     4.972572s]  INFO(board_artiq::serwb): AMC serwb settings:
[     4.978120s]  INFO(board_artiq::serwb):   bitslip: 25
[     4.983157s]  INFO(board_artiq::serwb):   ready: 1
[     4.987933s]  INFO(board_artiq::serwb):   error: 0
[     4.992708s]  INFO(board_artiq::serwb): RTM serwb settings:
[     4.998273s]  INFO(board_artiq::serwb):   bitslip: 20
[     5.003310s]  INFO(board_artiq::serwb):   ready: 1
[     5.008086s]  INFO(board_artiq::serwb):   error: 0
[     5.013157s]  INFO(board_artiq::serwb): RTM gateware version 4.0.dev+1148.gf385add8.dirty
[     5.021022s]  INFO(runtime): press 'e' to erase startup and idle kernels...
[     6.021006s]  INFO(runtime): continuing boot
[     6.024066s]  INFO(board_artiq::hmc830_7043::hmc830): HMC830 found
[     6.030134s]  INFO(board_artiq::hmc830_7043::hmc830): loading HMC830 configuration...
[     6.038181s]  INFO(board_artiq::hmc830_7043::hmc830):   ...done
[     6.043837s]  INFO(board_artiq::hmc830_7043::hmc830): setting HMC830 dividers...
[     6.051299s]  INFO(board_artiq::hmc830_7043::hmc830):   ...done
[     6.057122s]  INFO(board_artiq::hmc830_7043::hmc830): waiting for HMC830 lock...
[     6.064540s]  INFO(board_artiq::hmc830_7043::hmc830):   ...locked
[     6.070582s]  INFO(board_artiq::hmc830_7043::hmc7043): enabling hmc7043
[     6.077457s]  INFO(board_artiq::hmc830_7043::hmc7043): HMC7043 found
[     6.083536s]  INFO(board_artiq::hmc830_7043::hmc7043): loading configuration...
[     6.092816s]  INFO(board_artiq::hmc830_7043::hmc7043):   ...done
[     6.097871s]  INFO(board_artiq::ad9154): AD9154-0 found
[     6.113009s]  INFO(board_artiq::ad9154): AD9154-0 initializing...
[     6.125898s]  INFO(board_artiq::ad9154):   ...done
[     6.199783s]  INFO(board_artiq::ad9154): AD9154-0 running PRBS test...
[     7.206158s]  INFO(board_artiq::ad9154):   ...passed
[     7.209817s]  INFO(board_artiq::ad9154): AD9154-0 running STPL test...
[     7.216606s]  INFO(board_artiq::ad9154):   c0 errors: 0
[     7.221810s]  INFO(board_artiq::ad9154):   c1 errors: 0
[     7.227020s]  INFO(board_artiq::ad9154):   c2 errors: 0
[     7.232230s]  INFO(board_artiq::ad9154):   c3 errors: 0
[     7.237153s]  INFO(board_artiq::ad9154):   ...passed
[     7.242102s]  INFO(board_artiq::ad9154): AD9154-0 SYSREF scan:
[     7.522189s]  INFO(board_artiq::ad9154):   phase: 26, sync error: 481
[     8.177445s]  INFO(board_artiq::ad9154):   phase: 90, sync error: 482
[     8.182583s]  INFO(board_artiq::ad9154):   phase min: Some(26), phase max: Some(89)
[     8.190225s]  INFO(board_artiq::ad9154): AD9154-0 setting SYSREF phase to 88
[     8.207610s]  INFO(board_artiq::ad9154): AD9154-0 initializing...
[     8.220463s]  INFO(board_artiq::ad9154):   ...done
[     8.294670s]  INFO(board_artiq::ad9154): AD9154-1 found
[     8.308868s]  INFO(board_artiq::ad9154): AD9154-1 initializing...
[     8.321826s]  INFO(board_artiq::ad9154):   ...done
[     8.395699s]  INFO(board_artiq::ad9154): AD9154-1 running PRBS test...
[     9.402066s]  INFO(board_artiq::ad9154):   ...passed
[     9.405724s]  INFO(board_artiq::ad9154): AD9154-1 running STPL test...
[     9.412509s]  INFO(board_artiq::ad9154):   c0 errors: 0
[     9.417718s]  INFO(board_artiq::ad9154):   c1 errors: 0
[     9.422928s]  INFO(board_artiq::ad9154):   c2 errors: 0
[     9.428138s]  INFO(board_artiq::ad9154):   c3 errors: 0
[     9.433060s]  INFO(board_artiq::ad9154):   ...passed
[     9.438009s]  INFO(board_artiq::ad9154): AD9154-1 SYSREF scan:
[     9.707938s]  INFO(board_artiq::ad9154):   phase: 25, sync error: 481
[    10.363189s]  INFO(board_artiq::ad9154):   phase: 89, sync error: 482
[    10.368326s]  INFO(board_artiq::ad9154):   phase min: Some(25), phase max: Some(88)
[    10.375967s]  INFO(board_artiq::ad9154): AD9154-1 setting SYSREF phase to 88
[    10.393355s]  INFO(board_artiq::ad9154): AD9154-1 initializing...
[    10.406314s]  INFO(board_artiq::ad9154):   ...done
[    10.480177s]  INFO(board_artiq::hmc542): card 0 channel 0 set to 4 dB
[    10.487427s]  INFO(board_artiq::hmc542): card 0 channel 1 set to 4 dB
[    10.494666s]  INFO(board_artiq::hmc542): card 1 channel 0 set to 4 dB
[    10.501906s]  INFO(board_artiq::hmc542): card 1 channel 1 set to 4 dB
[    10.509146s]  INFO(board_artiq::hmc542): card 2 channel 0 set to 4 dB
[    10.516386s]  INFO(board_artiq::hmc542): card 2 channel 1 set to 4 dB
[    10.523625s]  INFO(board_artiq::hmc542): card 3 channel 0 set to 4 dB
[    10.530865s]  INFO(board_artiq::hmc542): card 3 channel 1 set to 4 dB
[    10.538137s]  WARN(runtime): using default MAC address 02-00-00-00-00-01; consider changing it
[    10.545546s]  INFO(runtime): using default IP address 192.168.1.50
[    10.553297s]  INFO(runtime::mgmt): management interface active
[    10.566496s]  INFO(runtime::session): accepting network sessions
[    10.580699s]  INFO(runtime::session): running startup kernel
[    10.609958s]  INFO(runtime::kern_hwreq): resetting RTIO
[    10.615796s]  INFO(kernel): 10000000.000000
[    10.622048s]  INFO(kernel): 10141414.141414
[    10.628290s]  INFO(kernel): 10282828.282828
[    10.634519s]  INFO(kernel): 10424242.424242
[    10.640768s]  INFO(kernel): 10565656.565656
[    10.647010s]  INFO(kernel): 10707070.707070
[    10.653258s]  INFO(kernel): 10848484.848484
[    10.659508s]  INFO(kernel): 10989898.989898
[    10.665755s]  INFO(kernel): 11131313.131313
[    10.672004s]  INFO(kernel): 11272727.272727
[    10.678253s]  INFO(kernel): 11414141.414141
[    10.684495s]  INFO(kernel): 11555555.555555
[    10.690749s]  INFO(kernel): 11696969.696969
[    10.696997s]  INFO(kernel): 11838383.838383
[    10.703349s]  INFO(kernel): 11979797.979797
[    10.709604s]  INFO(kernel): 12121212.121212
[    10.715959s]  INFO(kernel): 12262626.262626
[    10.722203s]  INFO(kernel): 12404040.404040
[    10.728553s]  INFO(kernel): 12545454.545454
[    10.734901s]  INFO(kernel): 12686868.686868
[    10.741249s]  INFO(kernel): 12828282.828282
[    10.747599s]  INFO(kernel): 12969696.969696
hartytp commented 6 years ago

NB I haven't built with @jordens latest finds yet.

hartytp commented 6 years ago

I take that back.

A while later, I got some \FF printed to the UART.

Then, I rebooted and the Mem test gave all 1s

hartytp commented 6 years ago

~@sbourdeauducq what do you expect to see on the UART if that Kernel runs correctly?~

hartytp commented 6 years ago

hmm...after a reboot, the kernel ran (same output on UART), 5 minutes later, no crash afaict. Re loading the AMC FPGA with artiq_flash -t sayma ... start mem test looks good, and the Kernel runs again with the same output.

... a while later, and @ appeared on the UART. But, artiq_flash ... start still booted correctly afterwards.

sbourdeauducq commented 6 years ago

ok, same crash here, but it happens faster...

sbourdeauducq commented 6 years ago

Can you try running the reboot loops longer? @gkasprow Can you test this too? Any idea why this happens?

sbourdeauducq commented 6 years ago

I tried running the crash kernel on the Sayma DRTIO master, which does not use the HMC clock chips (or anything on the RTM). It does NOT crash. Could it be a power integrity problem?

sbourdeauducq commented 6 years ago

Tried adding some of the problematic buffers to the DRTIO master - still no crash when running the kernel.

@@ -301,6 +299,22 @@ class Master(MiniSoC, AMPSoC):
         self.submodules += Microscope(platform.request("serial", 1),
                                       self.clk_freq)

+        self.clock_domains.cd_jesd = ClockDomain()
+        refclk_pads = platform.request("dac_refclk", 0)
+        refclk2 = Signal()
+        platform.add_period_constraint(refclk_pads.p, 1e9/150e6)
+        self.specials += [
+            Instance("IBUFDS_GTE3", i_CEB=0, p_REFCLK_HROW_CK_SEL=0b00,
+                     i_I=refclk_pads.p, i_IB=refclk_pads.n,
+                     o_ODIV2=refclk2),
+            Instance("BUFG_GT", i_I=refclk2, o_O=self.cd_jesd.clk)
+        ]
+        blink_counter = Signal(28)
+        blink = Signal()
+        self.sync.jesd += blink_counter.eq(blink_counter + 1)
+        self.comb += blink.eq(blink_counter[-1])
+        self.submodules += add_probe_async("blink", "blink", blink)
+
         # Si5324 used as a free-running oscillator, to avoid dependency on RTM.
         self.submodules.si5324_rst_n = gpio.GPIOOut(platform.request("si5324").rst_n)
         self.csr_devices.append("si5324_rst_n")

The blink signal behaves erratically however (with the RTM FPGA freshly loaded) - maybe the rework on my board is broken and the counter is "clocked" by the 7043 noise. How is it on other boards?

sbourdeauducq commented 6 years ago

No crash either when the kernel is run on a --without-sawg build. @gkasprow @hartytp @jbqubit Can you reproduce this? This starts to look like PI or a Vivado issue...

sbourdeauducq commented 6 years ago

@gkasprow Can you measure the supply voltages as close as possible to the FPGA, with the SAWG running?

hartytp commented 6 years ago

No crash either when the kernel is run on a --without-sawg build. @gkasprow @hartytp @jbqubit Can you reproduce this?

I'll have a look at that on Monday.

or a Vivado issue...

Try building with 2017.x?

hartytp commented 6 years ago

The blink signal behaves erratically however (with the RTM FPGA freshly loaded) - maybe the rework on my board is broken and the counter is "clocked" by the 7043 noise. How is it on other boards?

Did you disabling the firmware line that sets the HMC7043 RESET LOW in the standalone build? If that doesn't stop the DAC PLLs locking then there is something wrong with your rework and all bets are off.

Also, are you doing this after @jordens latest round of reviews, catching some incorrect logic levels/terminations/etc?

sbourdeauducq commented 6 years ago

Bad PI would also explain the trashing of the whole FPGA by the 7043. the noise would create increased switching and power consumption...

gkasprow commented 6 years ago

PI was carefully simulated so only some transients on supply rails could be source of the issues. I'll look at the power rails on Tuesday and also try to reproduce the problem.

marmeladapk commented 6 years ago

@sbourdeauducq We'll build Artiq with your patch and check on Tuesday.

sbourdeauducq commented 6 years ago

Same behavior with Vivado 2017.4 (meets timing, crashes with the test kernel).

sbourdeauducq commented 6 years ago

I'll look at the power rails on Tuesday and also try to reproduce the problem.

@gkasprow What did you find?

hartytp commented 6 years ago

@gkasprow @marmeladapk might also be worth double checking that we really have followed all Xilinx user guides on power supplies, decoupling, etc and that that decoupling capacitors have the correct voltage rating etc.

sbourdeauducq commented 6 years ago

One workaround is to run the SAWG kernels over DRTIO. Both the Sayma DRTIO master and satellite seem surprisingly unaffected by the crashes, so far. If the master crashes it can in theory be replaced by Kasli. For using the Sayma DRTIO satellite, connect a coax cable between the clock output SMA on the AMC and the clock input SMA on the RTM.

marmeladapk commented 6 years ago

With boot::jump(0) Sayma goes through one full init sequence, reboots and then hangs after enabling hmc7043 is printed. Reproduced across many manual restarts and power cycling. Nothing out of the ordinary happens in memtests. Log

hartytp commented 6 years ago

@marmeladapk can you look at the HMC7043 outputs on the UFL connectors and see what the signal there looks like during this test?

Can you try removing the AC coupling capacitors that connect the HMC7043 to the FPGAs and see if this fixes the issue? If so, can you try adding the FET switches to the HMC7043.

gkasprow commented 6 years ago

I will test it in a moment.

hartytp commented 6 years ago

Note to self: one potentially significant difference between the DRIO and standalone Sayma builds is how the RTIO is clocked/reset. In standalone, the RTIO logic is clocked from the HMC7043 output and reset by the same signal that enables the clock input buffers:

https://github.com/m-labs/artiq/blob/0c32d07e8b59e60ee304753e732bfa0c07e0daa6/artiq/gateware/targets/sayma_amc.py#L60

So, the DAC init code also brings the RTIO core out of reset.

@sbourdeauducq should there be some delay between enabling the clock input buffers and bringing the RTIO core and other logic out of reset? Should we have two different CSRs? One to enable the CBs and one for the RTIO/JESD reset line.

gkasprow commented 6 years ago

I removed caps that connect HMC7043 with AMC FPGA: ADC SYSREF DAC SYSREF GTP1 GTP2 AMC MASTER AUX

marmeladapk commented 6 years ago

@hartytp It made it worse, now sayma always locks up on hmc830.

gkasprow commented 6 years ago

I don't see any link between removing the caps and hmc830 locking

hartytp commented 6 years ago

Can you post logs?

I'm not sure I follow you. Do you mean that the board now locks up during HMC830 in it??

marmeladapk commented 6 years ago

@hartytp With boot:jump(0) sayma would do one full cycle from memtest to dac init, then reboot and hang on hmc7043 init. Now it always hangs on hmc830 lock.

ed: log

sbourdeauducq commented 6 years ago

That's not hanging, that's the HMC830 not locking.

marmeladapk commented 6 years ago

@sbourdeauducq True, sorry for confusion.

marmeladapk commented 6 years ago

@hartytp It may be a red herring - now it fails to identify

invalid HMC830 ID: 0x00797500
panic at src/libcore/result.rs:945:5: cannot initialize HMC830/7043: "invalid HMC830 identification"

Perhaps the chip is failing.

gkasprow commented 6 years ago

the HMC has ID: 0xA7975 so it looks that A character is missing.

marmeladapk commented 6 years ago

On forth code id is read correctly. But chip fails to lock.

marmeladapk commented 6 years ago

Since it only happens on one RTM it seems to be hardware bug/failure. We'll continue discussion in https://github.com/sinara-hw/sinara/issues/472 .

Unfortunately it means we don't fully working setup ATM (on other RTM both DAC chips fail to init) so I cannot test memory in the same config as @sbourdeauducq . I saw that with today's build Artiq continues booting even if DACs are not properly initialised so I can test with that setup.

hartytp commented 6 years ago

Did the DACs fail to init before you removed those resistors?

marmeladapk commented 6 years ago

@hartytp yes.

gkasprow commented 6 years ago

@sbourdeauducq do you use FPGA_DAC_SYSREF from HMC7043 to AMC FPGA? It has very low amplitude, roughly 200mV while other signals are 1V or more. Another clock that is fed to the AMC FPGA (AMC_MASTER_AUX_CLK) is disabled, only GTP_CLK1 is on.

another question - I did not do any rework on HMC7043 reset pin. Does it affect DAC operation?

sbourdeauducq commented 6 years ago

do you use FPGA_DAC_SYSREF from HMC7043 to AMC FPGA?

Yes. What can cause the lower amplitude?

another question - I did not do any rework on HMC7043 reset pin. Does it affect DAC operation?

It seems to makes the crashes worse.

hartytp commented 6 years ago

Sysref is lvds others are lvpecl.

hartytp commented 6 years ago

@gkasprow HMC7043 reset seems to help. Connect to FPGA and pull HIGH.

gkasprow commented 6 years ago

Can we set output of HMC7043 FPGA SYSREF to LVPECL? it is AC-terminated so it does not mater. At the moment it has two 200R to GND so the amplitude gets attenuated seriously. I'm worried about low amplitude.

hartytp commented 6 years ago

@gkasprow two options:

  1. Revert the recent firmware comit that changed that output to LVDS. But do check that we don't exceed the max input on the MAX FPGA
  2. Remove bias resistors and remeasure the voltage
hartytp commented 6 years ago

Still I don't think that the crashes are related to a low sysref signal

gkasprow commented 6 years ago

I'm not worried about crashes but errors on JESD links. I observe them on two boards. @marmeladapk just modified registers.

hartytp commented 6 years ago

I guess the load is quite high because we also use HMC7043 internal 100R termination which explains the small signal.

Do check the max diff signal for those 1v8 inputs

gkasprow commented 6 years ago

with LVPECL I observe 0.8Vpp on each output. so we won't break the LVDS input. But it did not help and I still get PRBS errors.

gkasprow commented 6 years ago

I reverted original HMC7043 reset, added 4k7pullup but this did not help.

hartytp commented 6 years ago

Do you have the capacitors connecting the AMC to the HMC7043 depopulated?

hartytp commented 6 years ago

Otherwise I'm not sure, haven't had that error.