m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
437 stars 201 forks source link

Sayma: memory corruption #1065

Closed hartytp closed 6 years ago

hartytp commented 6 years ago
 __  __ _ ____         ____ 
|  \/  (_) ___|  ___  / ___|
| |\/| | \___ \ / _ \| |    
| |  | | |___) | (_) | |___ 
|_|  |_|_|____/ \___/ \____|

MiSoC Bootloader
Copyright (c) 2017-2018 M-Labs Limited

Bootloader CRC passed
Gateware ident 4.0.dev+1133.g0b086225
Initializing SDRAM...
DQS initial delay: 96 taps
Write leveling scan:
Module 3:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 2:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 1:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 0:
00000000000000000000000000000000000000000000000000000000000000000000000001000111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
DQS initial delay: 96 taps
Write leveling: 72 82 102 97 done
Read leveling scan:
Module 3:
00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 2:
00000000000000000000000000000000000000000000000000000000000000000000000000000101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 1:
00000000000000000000000000000000000000000000000000000000000001101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110111000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 0:
00000000000000000000000000000000000000000000000000000000111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Read leveling: 172+-75 159+-83 143+-81 136+-80 done
SDRAM initialized
Memory test passed

Booting from flash...
Starting firmware.
[     0.000007s]  INFO(runtime): ARTIQ runtime starting...
[     0.003881s]  INFO(runtime): software version 4.0.dev+1133.g0b086225
[     0.010232s]  INFO(runtime): gateware version 4.0.dev+1133.g0b086225
[     0.016597s]  INFO(runtime): log level set to INFO by default
[     0.022311s]  INFO(runtime): UART log level set to INFO by default
[     0.028454s]  INFO(board_artiq::serwb): waiting for AMC/RTM serwb bridge to be ready...
[     0.063183s]  INFO(board_artiq::serwb): done.
[     0.066213s]  INFO(board_artiq::serwb): RTM to AMC Link test
[     1.548256s]  INFO(board_artiq::serwb): 0 errors
[     1.551560s]  INFO(board_artiq::serwb): AMC to RTM Link test
[     3.033609s]  INFO(board_artiq::serwb): 0 errors
[     3.036907s]  INFO(board_artiq::serwb): Wishbone test...
[     4.968923s]  INFO(board_artiq::serwb): 0 errors
[     4.972230s]  INFO(board_artiq::serwb): AMC serwb settings:
[     4.977779s]  INFO(board_artiq::serwb):   bitslip: 10
[     4.982815s]  INFO(board_artiq::serwb):   ready: 1
[     4.987590s]  INFO(board_artiq::serwb):   error: 0
[     4.992366s]  INFO(board_artiq::serwb): RTM serwb settings:
[     4.997932s]  INFO(board_artiq::serwb):   bitslip: 35
[     5.002968s]  INFO(board_artiq::serwb):   ready: 1
[     5.007743s]  INFO(board_artiq::serwb):   error: 0
[     5.012754s]  INFO(board_artiq::serwb): RTM gateware version 4.0.dev+1133.g0b086225
[     5.020158s]  INFO(runtime): press 'e' to erase startup and idle kernels...
[     6.020007s]  INFO(runtime): continuing boot
[     6.023065s]  INFO(board_artiq::hmc830_7043::hmc830): HMC830 found
[     6.029133s]  INFO(board_artiq::hmc830_7043::hmc830): loading configuration...
[     6.036641s]  INFO(board_artiq::hmc830_7043::hmc830):   ...done
[     6.042230s]  INFO(board_artiq::hmc830_7043::hmc830): waiting for lock...
[     6.049036s]  INFO(board_artiq::hmc830_7043::hmc830):   ...locked
[     6.055080s]  INFO(board_artiq::hmc830_7043::hmc7043): enabling hmc7043
[     6.061955s]  INFO(board_artiq::hmc830_7043::hmc7043): HMC7043 found
[     6.068034s]  INFO(board_artiq::hmc830_7043::hmc7043): loading configuration...
[     6.077314s]  INFO(board_artiq::hmc830_7043::hmc7043):   ...done
@ panic at src/libcore/fmt/mod.rs:1096:40: index out of bounds: the len is 1 but the index is 1074179992
backtrace for software version 4.0.dev+1133.g0b086225:
0x40003334
0x4001513c
0x40015090
0x40015d4c
0x40002e30
0x40002c84
halting.
use `artiq_coreconfig write -s panic_reset 1` to restart instead

https://drive.google.com/open?id=1ZyI_0S0IJ-oKc15RgisTW4FABeBQvuub

jordens commented 6 years ago

Or a legitimate bug in the stack. ~Could you addr2line it?~

/home/tph/scratch/artiq/artiq/firmware/libunwind_backtrace/lib.rs:42
 (inlined by) /home/tph/scratch/artiq/artiq/firmware/runtime/main.rs:342
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1526382696973/work/src/libcore/panicking.rs:72
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1526382696973/work/src/libcore/panicking.rs:58
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1526382696973/work/src/libcore/fmt/mod.rs:1096
 (inlined by) /var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1526382696973/work/src/libcore/fmt/mod.rs:1047
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1526382696973/work/src/libcore/fmt/mod.rs:226
 (inlined by) /home/tph/scratch/artiq/artiq/firmware/runtime/main.rs:309
/home/tph/scratch/artiq/artiq/firmware/runtime/main.rs:319

This is hexdump(pc) crashing on the first println(). @whitequark

hartytp commented 6 years ago

also: https://hastebin.com/xabopifuwu.go

hartytp commented 6 years ago

I also had one occasion where Sayma crashed after loading HMC7043...done

whitequark commented 6 years ago

No, this is definitely memory corruption. Look at this line:

@ panic at src/libcore/fmt/mod.rs:1096:40: index out of bounds: the len is 1 but the index is 1074179992

It started printing a memory dump for a hardware exception in this function:

            fn hexdump(addr: u32) {
                let addr = (addr - addr % 4) as *const u32;
                let mut ptr  = addr;
                println!("@ {:08p}", ptr); // here
                for _ in 0..4 {
                    print!("+{:04x}: ", ptr as usize - addr as usize);
                    print!("{:08x} ",   unsafe { *ptr }); ptr = ptr.wrapping_offset(1);
                    print!("{:08x} ",   unsafe { *ptr }); ptr = ptr.wrapping_offset(1);
                    print!("{:08x} ",   unsafe { *ptr }); ptr = ptr.wrapping_offset(1);
                    print!("{:08x}\n",  unsafe { *ptr }); ptr = ptr.wrapping_offset(1);
                }
            }

and crashed. The UART write function looks like:

        for c in s.bytes() {
            unsafe {
                while csr::uart::txfull_read() != 0 {}
                csr::uart::rxtx_write(c)
            }
        }

so it crashed somewhere between the space after @ and the leading 0 of the pointer 0x. The fmt/mod.rs:1096 line looks like:

        let value = match arg.position {
            rt::v1::Position::Next => { *self.curarg.next().unwrap() }
            rt::v1::Position::At(i) => self.args[i], // here
        };

so the arg.position is wrong, and arg.position is generated by rustc. The specific value stored in it, 0x4006af98, would never be written there even if rustc or linker had a bug, since there is no relocation pointing at that word and no way for rustc to put a pointer there.

The only place where 0x4006af98 is loaded is:

4003c6a8 <_ZN11board_artiq11hmc830_70436hmc8304init17h2a543f37436723abE>:
...
4003cce8: l.movhi r3,0x4006
4003ccec: l.ori r3,r3,0xaf98
4003ccf0: l.sw -88(r2),r3
4003ccf4: l.sw -48(r2),r18
4003ccf8: l.sw -84(r2),r18
4003ccfc: l.movhi r3,0x4006 ; overwritten here

and located at that address is a " ...locked":

 4006af80 6c6f636b 2074696d 656f7574 20202e2e  lock timeout  ..
 4006af90 2e6c6f63 6b656400 4006af8c 0000000b  .locked.@.......

Here's the annotated assembly for hexdump:

/home/tph/scratch/artiq/artiq/firmware/runtime/main.rs:309
40002dcc:       18 60 40 02     l.movhi r3,0x4002
40002dd0:       a8 63 d1 ec     l.ori r3,r3,0xd1ec
40002dd4:       d7 e2 1f b0     l.sw -80(r2),r3
40002dd8:       9c 62 ff a8     l.addi r3,r2,-88
40002ddc:       d7 e2 1f ac     l.sw -84(r2),r3
40002de0:       9c 62 ff d4     l.addi r3,r2,-44
_ZN4core3fmt5Write9write_fmt17he4ecc1ba5d6c8c02E():
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1526382696973/work/src/libcore/fmt/mod.rs:226
40002de4:       d7 e2 1f b8     l.sw -72(r2),r3
40002de8:       9c 60 00 02     l.addi r3,r0,2
40002dec:       d7 e2 1f c0     l.sw -64(r2),r3 ; pieces.len=2
40002df0:       18 60 40 06     l.movhi r3,0x4006
40002df4:       a8 63 47 34     l.ori r3,r3,0x4734
40002df8:       d7 e2 1f bc     l.sw -68(r2),r3 ; pieces.ptr=&[&str] at 0x40064734
40002dfc:       1b 80 40 06     l.movhi r28,0x4006
40002e00:       a8 7c 47 44     l.ori r3,r28,0x4744
_ZN7runtime9exception7hexdump17h3ae7890725df825dE():
40002e04:       d7 e2 1f 98     l.sw -104(r2),r3
_ZN4core3fmt5Write9write_fmt17he4ecc1ba5d6c8c02E():
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1526382696973/work/src/libcore/fmt/mod.rs:226
40002e08:       d7 e2 1f c4     l.sw -60(r2),r3 ; fmt.ptr=&[fmt::rt::v1::Argument] at 0x40064744
40002e0c:       9c 62 ff ac     l.addi r3,r2,-84
40002e10:       d7 e2 1f cc     l.sw -52(r2),r3 ; args.ptr=&[ArgumentV1] on stack
40002e14:       9c 60 00 01     l.addi r3,r0,1
40002e18:       d7 e2 1f c8     l.sw -56(r2),r3 ; fmt.len=1
40002e1c:       d7 e2 1f d0     l.sw -48(r2),r3 ; args.len=1
40002e20:       1b 40 40 06     l.movhi r26,0x4006
40002e24:       a8 9a 8d 5c     l.ori r4,r26,0x8d5c
40002e28:       9c 62 ff b8     l.addi r3,r2,-72
40002e2c:       9c a2 ff bc     l.addi r5,r2,-68
40002e30:       04 00 4a f7     l.jal 40015a0c <_ZN4core3fmt5write17h031cf49cf7af6ca6E>
40002e34:       d7 e2 27 94     l.sw -108(r2),r4

As can be seen, there is only a 3-instruction window where r3 is 0x4006af98. Assuming no CPU bugs this implies that the store that happens right after hmc830 locks (there are no other RAM stores in the loop that polls the lock) misfires and it's written somewhere in rodata instead of stack.

Hope this helps to pinpoint the issue.

hartytp commented 6 years ago

@whitequark is there anything useful in the second crash I posted?

hartytp commented 6 years ago

@whitequark @sbourdeauducq How do we go about debugging this?

Can we hook up gdb to Sayma? Or, do you have a better/different idea for tracking this down?

sbourdeauducq commented 6 years ago

If it's memory corruption, gdb cannot help.

whitequark commented 6 years ago

@whitequark is there anything useful in the second crash I posted?

Hmm. What's at PC definitely doesn't look like code:

   0:   0a 01 f4 41     *unknown*
   4:   82 28 a0 28     *unknown*
   8:   80 2a 04 b0     *unknown*
   c:   44 04 10 91     *unknown*
  10:   0d 00 41 41     l.bnf 4010514 <_binary_x_bin_end+0x40104d4>
  14:   28 22 8a 88     *unknown*
  18:   08 88 00 80     *unknown*
  1c:   41 10 44 14     *unknown*
  20:   04 10 55 01     l.jal 415424 <_binary_x_bin_end+0x4153e4>
  24:   c2 0a 0a aa     l.mtspr r10,r1,0x82aa
  28:   28 a2 a6 28     *unknown*
  2c:   21 44 0b 5d     *unknown*
  30:   40 45 00 00     *unknown*
  34:   88 80 22 ea     l.lws r4,8938(r0)
  38:   22 82 88 c2     *unknown*
  3c:   00 0b 11 51     l.j 2c4580 <_binary_x_bin_end+0x2c4540>

It doesn't look like data either:

 0000 41f4010a 28a02882 b0042a80 91100444  A...(.(...*....D
 0010 4141000d 888a2228 80008808 14441041  AA...."(.....D.A
 0020 01551004 aa0a0ac2 28a6a228 5d0b4421  .U......(..(].D!
 0030 00004540 ea228088 c2888222 51110b00  ..E@."....."Q...

What it does look like is line noise with an average of ~1 bit per nibble set.

Can we hook up gdb to Sayma?

I have the code for doing this but I do not believe it will be useful in any way here.

jordens commented 6 years ago

Remnants from the memory test at boot? Maybe clear all sdram or set it to some known value after testing it.

jordens commented 6 years ago

But either way. Probably just noise. 0x440249cc is also pretty high.

hartytp commented 6 years ago

So....what do we do about this?

gkasprow commented 6 years ago

we can run more advanced memory test that scans all memory. I've noticed that on one of our boards the memory test does not pass anymore. And on another board it simply hangs without warning. We didn't debug it yet.

hartytp commented 6 years ago

I've noticed that on one of our boards the memory test does not pass anymore

That's interesting. Does unplugging the RTM make a difference? It would be great to try to reproduce this in the simplest possible configuration.

NB I get the impression that there are some build-build variations (vivado issues?) so if you find a build that reproduces this reliably, please post it!

marmeladapk commented 6 years ago

@hartytp We'll check without RTM tommorow. We used the same build on several boards and only two showed such issues so it's more likely to be some kind of hardware failure.

hartytp commented 6 years ago

:+1:

gkasprow commented 6 years ago

@marmeladapk we should load some old version of the gateware. Both boards used to work. It's hard to believe they got broken by themself.

hartytp commented 6 years ago

Here is my thinking on this:

How does that sound?

hartytp commented 6 years ago

hmmm...it seems there is definitely a memory corruption issue that's not caused by the HMC7043 https://github.com/m-labs/artiq/pull/1068#issuecomment-396542824

sbourdeauducq commented 6 years ago

Running this as a startup kernel crashes Sayma in 100% of the cases:

from artiq.experiment import *

class SAWGTestSines(EnvExperiment):
    def build(self):
        self.setattr_device("core")
        self.sawgs = [self.get_device("sawg"+str(i)) for i in range(8)]
        self.freq_list = self.linspace(10.0*MHz, 24.0*MHz, 100)

    def linspace(self, start, stop, num):
        dx = (stop-start)/(num-1) 
        return [start+n*dx for n in range(num)]

    @kernel
    def run(self):
        self.core.reset()
        delay(1*ms)

        while True:
            self.k1()            

    @kernel
    def k1(self):
        for freq in self.freq_list:
            delay(300*ms)
            core_log(freq)
            for sawg in self.sawgs:
                sawg.amplitude1.set(0.99)
                sawg.frequency0.set(freq)
                sawg.frequency1.set(5*MHz)
                sawg.frequency2.set(-5*MHz)
                delay(1*ms)

I have not found yet what exactly crashes; making minor changes tends to make the crashes disappear.

Symptoms of the crash vary between reboot, freeze, and garbage printed on the UART.

According to Joe the crash also occurs when this is run over Ethernet.

hartytp commented 6 years ago

While leaving Sayma idling, I see panic at runtime/main.rs:321:13: exception Alignment at PC 0x4002ee84, EA 0x4015a184

https://hastebin.com/efofewifey.go

Startup kernel is: https://hastebin.com/ilawajuqog.py

sbourdeauducq commented 6 years ago

Running this as a startup kernel crashes Sayma in 100% of the cases:

The same kernel does not crash the KC705, so it's not a compiler or runtime problem.

sbourdeauducq commented 6 years ago

Another data point. Making a reboot loop by jumping to the bootloader right after HMC830 and HMC7043 initialization does not crash; the board went on for over an hour and behaved normally.

diff --git a/artiq/firmware/runtime/main.rs b/artiq/firmware/runtime/main.rs
index f5c0a0202..560aa0de2 100644
--- a/artiq/firmware/runtime/main.rs
+++ b/artiq/firmware/runtime/main.rs
@@ -108,6 +108,7 @@ fn startup() {
     #[cfg(has_hmc830_7043)]
     /* must be the first SPI init because of HMC830 SPI mode selection */
     board_artiq::hmc830_7043::init().expect("cannot initialize HMC830/7043");
+    unsafe { boot::jump(0); }
     #[cfg(has_ad9154)]
     board_artiq::ad9154::init().expect("cannot initialize AD9154");
     #[cfg(has_allaki_atts)]

Moving the jump after the DAC initialization:

diff --git a/artiq/firmware/runtime/main.rs b/artiq/firmware/runtime/main.rs
index f5c0a0202..6a580d37d 100644
--- a/artiq/firmware/runtime/main.rs
+++ b/artiq/firmware/runtime/main.rs
@@ -110,6 +110,7 @@ fn startup() {
     board_artiq::hmc830_7043::init().expect("cannot initialize HMC830/7043");
     #[cfg(has_ad9154)]
     board_artiq::ad9154::init().expect("cannot initialize AD9154");
+    unsafe { boot::jump(0); }
     #[cfg(has_allaki_atts)]
     board_artiq::hmc542::program_all(8/*=4dB*/);

causes insane behavior in parts not related to the DAC after a few reboots, e.g.:

Initializing SDRAM...
DQS initial delay: 98 taps
Write leveling scan:
Module 3:
000000000000000000000000000000000000000000000000000000000000000000000000000000010001011001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111100101111101000100010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100
Module 2:
000000000000000000000000000000000000000000000000000000000000000000100111111110111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110100100100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100011101
Module 1:
000000000000000000000000000000000000000000000000000000000000000000101000101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000110000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000100010000000101001
Module 0:
000000000000000000000000000000000000000000000000000000100101001011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000101010
DQS initial delay: 98 taps
Write leveling: 60 69 67 79 done
Read leveling scan:
Module 3:
00000000000000000000000000000000000010000000000000000000101100000110100011101101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110111101011111011010110100100010110000000010010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 2:
00000000000000000000000000000000000000000000101000001010110000001011111111111111011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101111101101011101000101111000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 1:
00000000000000000000000000010101100100010011111111011011111111011111111111111111111111111111111111111111111111111111111111111111111111111111111111111110111111101111101111110110010100001110000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 0:
00000000000000000000000010110010101000110011111111101111111111111111111111111111111111111111111111111101111111111111111111111111111111111111111111011011100010010110100000011100100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Read leveling: 132+-23 122+-27 101+-27 86+-25 done
SDRAM initialized
Memory test failed (83/4458496 words incorrect)
Halting.

(the errors are varied; sometimes it's "Memory test failed" as above, sometimes it's "illegal instruction" during serwb initialization, etc.)

sbourdeauducq commented 6 years ago

@hartytp @gkasprow @jbqubit @jordens Can you reproduce those "reboot loop" results?

sbourdeauducq commented 6 years ago

It's jesd_unreset() that causes the crashy behavior.

This causes mayhem:

diff --git a/artiq/firmware/libboard_artiq/ad9154.rs b/artiq/firmware/libboard_artiq/ad9154.rs
index 02f2d7a77..6fd556c49 100644
--- a/artiq/firmware/libboard_artiq/ad9154.rs
+++ b/artiq/firmware/libboard_artiq/ad9154.rs
@@ -633,13 +633,13 @@ fn dac_stpl(dacno: u8, m: u8, s: u8) -> Result<(), &'static str> {

 fn dac_cfg(dacno: u8) -> Result<(), &'static str> {
     spi_setup(dacno);
-    jesd_enable(dacno, false);
+    /*jesd_enable(dacno, false);
     jesd_prbs(dacno, false);
     jesd_stpl(dacno, false);
     clock::spin_us(10000);
-    jesd_enable(dacno, true);
+    jesd_enable(dacno, true);*/
     dac_setup(dacno, 6_000_000_000)?;
-    jesd_enable(dacno, false);
+    /*jesd_enable(dacno, false);
     clock::spin_us(10000);
     jesd_enable(dacno, true);
     dac_monitor(dacno);
@@ -665,7 +665,7 @@ fn dac_cfg(dacno: u8) -> Result<(), &'static str> {
     }
     if read(ad9154_reg::INITLANESYNCFLG) != 0xff {
         return Err("bad INITLANESYNCFLG")
-    }
+    }*/
     Ok(())
 }

@@ -730,12 +730,12 @@ pub fn init() -> Result<(), &'static str> {
         dac_detect(dacno)?;
         dac_cfg_retry(dacno)?;
         // Run the PRBS, STPL and SYSREF scan tests
-        dac_prbs(dacno)?;
+        /*dac_prbs(dacno)?;
         dac_stpl(dacno, 4, 2)?;
         dac_sysref_scan(dacno);
         // Set SYSREF phase and reconfigure the DAC
         dac_sysref_cfg(dacno, 88);
-        dac_cfg_retry(dacno)?;
+        dac_cfg_retry(dacno)?;*/
     }

     Ok(())
diff --git a/artiq/firmware/runtime/main.rs b/artiq/firmware/runtime/main.rs
index f5c0a0202..6a580d37d 100644
--- a/artiq/firmware/runtime/main.rs
+++ b/artiq/firmware/runtime/main.rs
@@ -110,6 +110,7 @@ fn startup() {
     board_artiq::hmc830_7043::init().expect("cannot initialize HMC830/7043");
     #[cfg(has_ad9154)]
     board_artiq::ad9154::init().expect("cannot initialize AD9154");
+    unsafe { boot::jump(0); }
     #[cfg(has_allaki_atts)]
     board_artiq::hmc542::program_all(8/*=4dB*/);

Comment out jesd_unreset() and the board behaves normally.

enjoy-digital commented 6 years ago

@sbourdeauducq: in your first case, the HMC830 and HMC7043 are initialized, but the buffers on the FPGA inputs are still disabled. (We enable them when initializing the DAC). If HMC7043 is now no longer supposed to generate broadband noise, we could remove the enable on the buffers: https://github.com/m-labs/artiq/blob/master/artiq/gateware/targets/sayma_amc.py#L62 https://github.com/m-labs/artiq/blob/master/artiq/gateware/targets/sayma_amc.py#L72

I think that would be interesting to do your first test with the buffers always enabled.

enjoy-digital commented 6 years ago

@sbourdeauducq: so let's remove the enables on the buffers (or add a separate control for them) and redo your first test. If that's still working fine, then it's related to the JESD/DACs. If not, then to the HMC830/HMC7043.

hartytp commented 6 years ago

If HMC7043 is now no longer supposed to generate broadband noise, we could remove the enable on the buffers:

With the pull-ups and RESET line, it doesn't generate noise. However, it does generate a clock at the wrong frequency for a short time during the HMC7043 init. Not sure how much of a problem that is.

I think that would be interesting to do your first test with the buffers always enabled.

Yes.

sbourdeauducq commented 6 years ago

With this patch the board simply never boots.

diff --git a/artiq/firmware/runtime/main.rs b/artiq/firmware/runtime/main.rs
index f5c0a0202..560aa0de2 100644
--- a/artiq/firmware/runtime/main.rs
+++ b/artiq/firmware/runtime/main.rs
@@ -108,6 +108,7 @@ fn startup() {
     #[cfg(has_hmc830_7043)]
     /* must be the first SPI init because of HMC830 SPI mode selection */
     board_artiq::hmc830_7043::init().expect("cannot initialize HMC830/7043");
+    unsafe { boot::jump(0); }
     #[cfg(has_ad9154)]
     board_artiq::ad9154::init().expect("cannot initialize AD9154");
     #[cfg(has_allaki_atts)]
diff --git a/artiq/gateware/targets/sayma_amc.py b/artiq/gateware/targets/sayma_amc.py
index 6f34e8663..17f5b175d 100755
--- a/artiq/gateware/targets/sayma_amc.py
+++ b/artiq/gateware/targets/sayma_amc.py
@@ -59,7 +59,7 @@ class AD9154CRG(Module, AutoCSR):

         platform.add_period_constraint(refclk_pads.p, 1e9/self.refclk_freq)
         self.specials += [
-            Instance("IBUFDS_GTE3", i_CEB=self.jreset.storage, p_REFCLK_HROW_CK_SEL=0b00,
+            Instance("IBUFDS_GTE3", i_CEB=0, p_REFCLK_HROW_CK_SEL=0b00,
                      i_I=refclk_pads.p, i_IB=refclk_pads.n,
                      o_O=self.refclk, o_ODIV2=refclk2),
             Instance("BUFG_GT", i_I=refclk2, o_O=self.cd_jesd.clk),
@@ -67,9 +67,7 @@ class AD9154CRG(Module, AutoCSR):
         ]

         jref = platform.request("dac_sysref")
-        self.specials += Instance("IBUFDS_IBUFDISABLE",
-            p_USE_IBUFDISABLE="TRUE", p_SIM_DEVICE="ULTRASCALE",
-            i_IBUFDISABLE=self.jreset.storage,
+        self.specials += Instance("IBUFDS",
             i_I=jref.p, i_IB=jref.n,
             o_O=self.jref)

Bootloader CRC passed
Gateware ident 4.0.dev+1148.gf385add8.dirty
Initializing SDRAM...
DQS initial delay: 90 taps
Write leveling scan:
Module 3:
00000000000000000000000000000000000000000000000000000000000000000000000001011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101111001001000001000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000100000000000001100001000111111011111111
Module 2:
00000000000000000000000000000000000000000000000000000000100011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101011000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000011011101110111111111111111111111111
Module 1:
00000000000000000000000000000000000000000000000000000000110001011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110110111001100000011100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000001101001111001110111111111111111111111
Module 0:
00000000000000000000000000000000000000000000000000011111110111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000100000011011000001111011101110111111111111111111111
DQS initial delay: 90 taps
Write leveling: 46 51 56 62 done
Read leveling scan:
Module 3:
00000000000000000000000000000000010000000000010110101100011101010101011111100111011111111111011111111111111111111111111111111111111111111111111111111111011111111101010111111111000010110000010010000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 2:
00000000000000010100000000001000100000010010010010100110011111111111111111011111111111111111111111111111111111111111111111111111111110111111111111111111101100011001111110110111010010000000010000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 1:
00000000000000000100000010000000001011111111110111111111111111111111111111111111111111111111111111111111111111111111111111111111111101111111011110111100010010010100001010000000100010000010100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 0:
11001011101111000101011101011101111111110111111101111111111111111111111111111111111111111111111111111111111111111111111011111111011001000001100100100000100000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Read leveling: 113+-7 98+-15 90+-7 74+-11 done
SDRAM initialized
Memory test failed (4670/4458496 words incorrect)
Halting.
hartytp commented 6 years ago

@sbourdeauducq interesting! The conclusion here is that the HMC7043 is still interfering with the FPGA, at least after recovering from a crash -- even with the RESET connected and pulled to 3V3.

The next obvious thing to do is to look at the HMC7043 outuputs on a scope. I'm happy to do that now if you can give me complete instructions for how to reproduce this, including relevant binaries.

sbourdeauducq commented 6 years ago

Simply apply the patch above - then it doesn't boot at all and fails memtest! No need to have a prior crash, this also happens right after a power cycle.

sbourdeauducq commented 6 years ago

And loading the RTM FPGA, which is supposed to hold the 7043 in reset, does not help.

sbourdeauducq commented 6 years ago

BTW, all the tests above were done with the LOCs and CLOCK_ROOTs removed, so it's not that interfering either.

sbourdeauducq commented 6 years ago

https://m-labs.hk/sayma/sayma_crash.tar.bz2 is with the patch above, the LOCs/CLOCK_ROOTs removed, and Vivado 2018.1 But it may be better that you try compiling yourself as well.

sbourdeauducq commented 6 years ago

Tomorrow I'll try with the RTM disconnected; can you try that too?

jbqubit commented 6 years ago

I'm at a meeting away from Maryland so can't test further until Monday.

hartytp commented 6 years ago

What startup Kernel are you using? And, other than the patch, this is with the current ARTIQ master, right?

sbourdeauducq commented 6 years ago

What startup Kernel are you using?

Irrelevant.

And, other than the patch, this is with the current ARTIQ master, right?

Yes. f385add8b1

hartytp commented 6 years ago

And, without SAWG is ok?

sbourdeauducq commented 6 years ago

Not tried. I only built with SAWG.

hartytp commented 6 years ago

@sbourdeauducq using the binaries you posted above, I see the following: https://drive.google.com/open?id=1UFnlC5iAUXjeOq-2ejaEp8Zt9Ez93hm-

sbourdeauducq commented 6 years ago

Can you try the to reproduce the reboot loops (and the crashes) here: https://github.com/m-labs/artiq/issues/1065#issuecomment-397527119

hartytp commented 6 years ago

@sbourdeauducq Sure.

To be clear:

hartytp commented 6 years ago

Also, are you sure that you have the HMC7043 rework done correctly, and that you're holding the chip in reset mode during boot? Can you try removing the AC coupling caps that connect the HMC7043 to the two FPGAs on your board?

sbourdeauducq commented 6 years ago

The next test you want me to perform is moving the jump(0) to after the DAC reset, while still keeping the FPGA input buffers permanently enabled, right?

No, unmodified bitstream.

BTW: does the test kernel above crash your board?

sbourdeauducq commented 6 years ago

I don't have the pullup, but if I keep the RTM FPGA loaded, in theory, it should not matter...

hartytp commented 6 years ago

No, unmodified bitstream.

You mean taking ARTIQ master and applying only this patch:

diff --git a/artiq/firmware/runtime/main.rs b/artiq/firmware/runtime/main.rs
index f5c0a0202..6a580d37d 100644
--- a/artiq/firmware/runtime/main.rs
+++ b/artiq/firmware/runtime/main.rs
@@ -110,6 +110,7 @@ fn startup() {
     board_artiq::hmc830_7043::init().expect("cannot initialize HMC830/7043");
     #[cfg(has_ad9154)]
     board_artiq::ad9154::init().expect("cannot initialize AD9154");
+    unsafe { boot::jump(0); }
     #[cfg(has_allaki_atts)]
     board_artiq::hmc542::program_all(8/*=4dB*/);

BTW: does the test kernel above crash your board?

will try that soon.

I don't have the pullup, but if I keep the RTM FPGA loaded, in theory, it should not matter...

true.

sbourdeauducq commented 6 years ago

You mean taking ARTIQ master and applying only this patch:

Correct.

hartytp commented 6 years ago

Okay, I'll build that now (with SAWG).

hartytp commented 6 years ago

@sbourdeauducq done. No crashes with that patch either. Log: https://drive.google.com/open?id=1ds1_6zj5BBNqK26HGczKXWkuby6dTMHy

Are you sure that you've applied the HMC7043 rework correctly. Can you patch the runtime to get the RTM to keep the 7043 in reset mode constantly and check that the DAC JESD no links break?

sbourdeauducq commented 6 years ago

Does this kernel crash on your board? https://github.com/m-labs/artiq/issues/1065#issuecomment-396824032