Closed hartytp closed 6 years ago
Or a legitimate bug in the stack. ~Could you addr2line
it?~
/home/tph/scratch/artiq/artiq/firmware/libunwind_backtrace/lib.rs:42
(inlined by) /home/tph/scratch/artiq/artiq/firmware/runtime/main.rs:342
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1526382696973/work/src/libcore/panicking.rs:72
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1526382696973/work/src/libcore/panicking.rs:58
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1526382696973/work/src/libcore/fmt/mod.rs:1096
(inlined by) /var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1526382696973/work/src/libcore/fmt/mod.rs:1047
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1526382696973/work/src/libcore/fmt/mod.rs:226
(inlined by) /home/tph/scratch/artiq/artiq/firmware/runtime/main.rs:309
/home/tph/scratch/artiq/artiq/firmware/runtime/main.rs:319
This is hexdump(pc)
crashing on the first println()
.
@whitequark
I also had one occasion where Sayma crashed after loading HMC7043...done
No, this is definitely memory corruption. Look at this line:
@ panic at src/libcore/fmt/mod.rs:1096:40: index out of bounds: the len is 1 but the index is 1074179992
It started printing a memory dump for a hardware exception in this function:
fn hexdump(addr: u32) {
let addr = (addr - addr % 4) as *const u32;
let mut ptr = addr;
println!("@ {:08p}", ptr); // here
for _ in 0..4 {
print!("+{:04x}: ", ptr as usize - addr as usize);
print!("{:08x} ", unsafe { *ptr }); ptr = ptr.wrapping_offset(1);
print!("{:08x} ", unsafe { *ptr }); ptr = ptr.wrapping_offset(1);
print!("{:08x} ", unsafe { *ptr }); ptr = ptr.wrapping_offset(1);
print!("{:08x}\n", unsafe { *ptr }); ptr = ptr.wrapping_offset(1);
}
}
and crashed. The UART write function looks like:
for c in s.bytes() {
unsafe {
while csr::uart::txfull_read() != 0 {}
csr::uart::rxtx_write(c)
}
}
so it crashed somewhere between the space after @
and the leading 0
of the pointer 0x
. The fmt/mod.rs:1096
line looks like:
let value = match arg.position {
rt::v1::Position::Next => { *self.curarg.next().unwrap() }
rt::v1::Position::At(i) => self.args[i], // here
};
so the arg.position
is wrong, and arg.position
is generated by rustc. The specific value stored in it, 0x4006af98
, would never be written there even if rustc or linker had a bug, since there is no relocation pointing at that word and no way for rustc to put a pointer there.
The only place where 0x4006af98
is loaded is:
4003c6a8 <_ZN11board_artiq11hmc830_70436hmc8304init17h2a543f37436723abE>:
...
4003cce8: l.movhi r3,0x4006
4003ccec: l.ori r3,r3,0xaf98
4003ccf0: l.sw -88(r2),r3
4003ccf4: l.sw -48(r2),r18
4003ccf8: l.sw -84(r2),r18
4003ccfc: l.movhi r3,0x4006 ; overwritten here
and located at that address is a " ...locked"
:
4006af80 6c6f636b 2074696d 656f7574 20202e2e lock timeout ..
4006af90 2e6c6f63 6b656400 4006af8c 0000000b .locked.@.......
Here's the annotated assembly for hexdump:
/home/tph/scratch/artiq/artiq/firmware/runtime/main.rs:309
40002dcc: 18 60 40 02 l.movhi r3,0x4002
40002dd0: a8 63 d1 ec l.ori r3,r3,0xd1ec
40002dd4: d7 e2 1f b0 l.sw -80(r2),r3
40002dd8: 9c 62 ff a8 l.addi r3,r2,-88
40002ddc: d7 e2 1f ac l.sw -84(r2),r3
40002de0: 9c 62 ff d4 l.addi r3,r2,-44
_ZN4core3fmt5Write9write_fmt17he4ecc1ba5d6c8c02E():
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1526382696973/work/src/libcore/fmt/mod.rs:226
40002de4: d7 e2 1f b8 l.sw -72(r2),r3
40002de8: 9c 60 00 02 l.addi r3,r0,2
40002dec: d7 e2 1f c0 l.sw -64(r2),r3 ; pieces.len=2
40002df0: 18 60 40 06 l.movhi r3,0x4006
40002df4: a8 63 47 34 l.ori r3,r3,0x4734
40002df8: d7 e2 1f bc l.sw -68(r2),r3 ; pieces.ptr=&[&str] at 0x40064734
40002dfc: 1b 80 40 06 l.movhi r28,0x4006
40002e00: a8 7c 47 44 l.ori r3,r28,0x4744
_ZN7runtime9exception7hexdump17h3ae7890725df825dE():
40002e04: d7 e2 1f 98 l.sw -104(r2),r3
_ZN4core3fmt5Write9write_fmt17he4ecc1ba5d6c8c02E():
/var/lib/buildbot/slaves/conda-lin64/miniconda/conda-bld/rust-core-or1k_1526382696973/work/src/libcore/fmt/mod.rs:226
40002e08: d7 e2 1f c4 l.sw -60(r2),r3 ; fmt.ptr=&[fmt::rt::v1::Argument] at 0x40064744
40002e0c: 9c 62 ff ac l.addi r3,r2,-84
40002e10: d7 e2 1f cc l.sw -52(r2),r3 ; args.ptr=&[ArgumentV1] on stack
40002e14: 9c 60 00 01 l.addi r3,r0,1
40002e18: d7 e2 1f c8 l.sw -56(r2),r3 ; fmt.len=1
40002e1c: d7 e2 1f d0 l.sw -48(r2),r3 ; args.len=1
40002e20: 1b 40 40 06 l.movhi r26,0x4006
40002e24: a8 9a 8d 5c l.ori r4,r26,0x8d5c
40002e28: 9c 62 ff b8 l.addi r3,r2,-72
40002e2c: 9c a2 ff bc l.addi r5,r2,-68
40002e30: 04 00 4a f7 l.jal 40015a0c <_ZN4core3fmt5write17h031cf49cf7af6ca6E>
40002e34: d7 e2 27 94 l.sw -108(r2),r4
As can be seen, there is only a 3-instruction window where r3 is 0x4006af98. Assuming no CPU bugs this implies that the store that happens right after hmc830 locks (there are no other RAM stores in the loop that polls the lock) misfires and it's written somewhere in rodata instead of stack.
Hope this helps to pinpoint the issue.
@whitequark is there anything useful in the second crash I posted?
@whitequark @sbourdeauducq How do we go about debugging this?
Can we hook up gdb to Sayma? Or, do you have a better/different idea for tracking this down?
If it's memory corruption, gdb cannot help.
@whitequark is there anything useful in the second crash I posted?
Hmm. What's at PC definitely doesn't look like code:
0: 0a 01 f4 41 *unknown*
4: 82 28 a0 28 *unknown*
8: 80 2a 04 b0 *unknown*
c: 44 04 10 91 *unknown*
10: 0d 00 41 41 l.bnf 4010514 <_binary_x_bin_end+0x40104d4>
14: 28 22 8a 88 *unknown*
18: 08 88 00 80 *unknown*
1c: 41 10 44 14 *unknown*
20: 04 10 55 01 l.jal 415424 <_binary_x_bin_end+0x4153e4>
24: c2 0a 0a aa l.mtspr r10,r1,0x82aa
28: 28 a2 a6 28 *unknown*
2c: 21 44 0b 5d *unknown*
30: 40 45 00 00 *unknown*
34: 88 80 22 ea l.lws r4,8938(r0)
38: 22 82 88 c2 *unknown*
3c: 00 0b 11 51 l.j 2c4580 <_binary_x_bin_end+0x2c4540>
It doesn't look like data either:
0000 41f4010a 28a02882 b0042a80 91100444 A...(.(...*....D
0010 4141000d 888a2228 80008808 14441041 AA...."(.....D.A
0020 01551004 aa0a0ac2 28a6a228 5d0b4421 .U......(..(].D!
0030 00004540 ea228088 c2888222 51110b00 ..E@."....."Q...
What it does look like is line noise with an average of ~1 bit per nibble set.
Can we hook up gdb to Sayma?
I have the code for doing this but I do not believe it will be useful in any way here.
Remnants from the memory test at boot? Maybe clear all sdram or set it to some known value after testing it.
But either way. Probably just noise. 0x440249cc is also pretty high.
So....what do we do about this?
we can run more advanced memory test that scans all memory. I've noticed that on one of our boards the memory test does not pass anymore. And on another board it simply hangs without warning. We didn't debug it yet.
I've noticed that on one of our boards the memory test does not pass anymore
That's interesting. Does unplugging the RTM make a difference? It would be great to try to reproduce this in the simplest possible configuration.
NB I get the impression that there are some build-build variations (vivado issues?) so if you find a build that reproduces this reliably, please post it!
@hartytp We'll check without RTM tommorow. We used the same build on several boards and only two showed such issues so it's more likely to be some kind of hardware failure.
:+1:
@marmeladapk we should load some old version of the gateware. Both boards used to work. It's hard to believe they got broken by themself.
Here is my thinking on this:
How does that sound?
hmmm...it seems there is definitely a memory corruption issue that's not caused by the HMC7043 https://github.com/m-labs/artiq/pull/1068#issuecomment-396542824
Running this as a startup kernel crashes Sayma in 100% of the cases:
from artiq.experiment import *
class SAWGTestSines(EnvExperiment):
def build(self):
self.setattr_device("core")
self.sawgs = [self.get_device("sawg"+str(i)) for i in range(8)]
self.freq_list = self.linspace(10.0*MHz, 24.0*MHz, 100)
def linspace(self, start, stop, num):
dx = (stop-start)/(num-1)
return [start+n*dx for n in range(num)]
@kernel
def run(self):
self.core.reset()
delay(1*ms)
while True:
self.k1()
@kernel
def k1(self):
for freq in self.freq_list:
delay(300*ms)
core_log(freq)
for sawg in self.sawgs:
sawg.amplitude1.set(0.99)
sawg.frequency0.set(freq)
sawg.frequency1.set(5*MHz)
sawg.frequency2.set(-5*MHz)
delay(1*ms)
I have not found yet what exactly crashes; making minor changes tends to make the crashes disappear.
Symptoms of the crash vary between reboot, freeze, and garbage printed on the UART.
According to Joe the crash also occurs when this is run over Ethernet.
While leaving Sayma idling, I see panic at runtime/main.rs:321:13: exception Alignment at PC 0x4002ee84, EA 0x4015a184
https://hastebin.com/efofewifey.go
Startup kernel is: https://hastebin.com/ilawajuqog.py
Running this as a startup kernel crashes Sayma in 100% of the cases:
The same kernel does not crash the KC705, so it's not a compiler or runtime problem.
Another data point. Making a reboot loop by jumping to the bootloader right after HMC830 and HMC7043 initialization does not crash; the board went on for over an hour and behaved normally.
diff --git a/artiq/firmware/runtime/main.rs b/artiq/firmware/runtime/main.rs
index f5c0a0202..560aa0de2 100644
--- a/artiq/firmware/runtime/main.rs
+++ b/artiq/firmware/runtime/main.rs
@@ -108,6 +108,7 @@ fn startup() {
#[cfg(has_hmc830_7043)]
/* must be the first SPI init because of HMC830 SPI mode selection */
board_artiq::hmc830_7043::init().expect("cannot initialize HMC830/7043");
+ unsafe { boot::jump(0); }
#[cfg(has_ad9154)]
board_artiq::ad9154::init().expect("cannot initialize AD9154");
#[cfg(has_allaki_atts)]
Moving the jump after the DAC initialization:
diff --git a/artiq/firmware/runtime/main.rs b/artiq/firmware/runtime/main.rs
index f5c0a0202..6a580d37d 100644
--- a/artiq/firmware/runtime/main.rs
+++ b/artiq/firmware/runtime/main.rs
@@ -110,6 +110,7 @@ fn startup() {
board_artiq::hmc830_7043::init().expect("cannot initialize HMC830/7043");
#[cfg(has_ad9154)]
board_artiq::ad9154::init().expect("cannot initialize AD9154");
+ unsafe { boot::jump(0); }
#[cfg(has_allaki_atts)]
board_artiq::hmc542::program_all(8/*=4dB*/);
causes insane behavior in parts not related to the DAC after a few reboots, e.g.:
Initializing SDRAM...
DQS initial delay: 98 taps
Write leveling scan:
Module 3:
000000000000000000000000000000000000000000000000000000000000000000000000000000010001011001111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111100101111101000100010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100
Module 2:
000000000000000000000000000000000000000000000000000000000000000000100111111110111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110100100100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100011101
Module 1:
000000000000000000000000000000000000000000000000000000000000000000101000101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111000110000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000100010000000101001
Module 0:
000000000000000000000000000000000000000000000000000000100101001011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000000000000000000000000000101010
DQS initial delay: 98 taps
Write leveling: 60 69 67 79 done
Read leveling scan:
Module 3:
00000000000000000000000000000000000010000000000000000000101100000110100011101101111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110111101011111011010110100100010110000000010010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 2:
00000000000000000000000000000000000000000000101000001010110000001011111111111111011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101111101101011101000101111000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 1:
00000000000000000000000000010101100100010011111111011011111111011111111111111111111111111111111111111111111111111111111111111111111111111111111111111110111111101111101111110110010100001110000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 0:
00000000000000000000000010110010101000110011111111101111111111111111111111111111111111111111111111111101111111111111111111111111111111111111111111011011100010010110100000011100100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Read leveling: 132+-23 122+-27 101+-27 86+-25 done
SDRAM initialized
Memory test failed (83/4458496 words incorrect)
Halting.
(the errors are varied; sometimes it's "Memory test failed" as above, sometimes it's "illegal instruction" during serwb initialization, etc.)
@hartytp @gkasprow @jbqubit @jordens Can you reproduce those "reboot loop" results?
It's jesd_unreset()
that causes the crashy behavior.
This causes mayhem:
diff --git a/artiq/firmware/libboard_artiq/ad9154.rs b/artiq/firmware/libboard_artiq/ad9154.rs
index 02f2d7a77..6fd556c49 100644
--- a/artiq/firmware/libboard_artiq/ad9154.rs
+++ b/artiq/firmware/libboard_artiq/ad9154.rs
@@ -633,13 +633,13 @@ fn dac_stpl(dacno: u8, m: u8, s: u8) -> Result<(), &'static str> {
fn dac_cfg(dacno: u8) -> Result<(), &'static str> {
spi_setup(dacno);
- jesd_enable(dacno, false);
+ /*jesd_enable(dacno, false);
jesd_prbs(dacno, false);
jesd_stpl(dacno, false);
clock::spin_us(10000);
- jesd_enable(dacno, true);
+ jesd_enable(dacno, true);*/
dac_setup(dacno, 6_000_000_000)?;
- jesd_enable(dacno, false);
+ /*jesd_enable(dacno, false);
clock::spin_us(10000);
jesd_enable(dacno, true);
dac_monitor(dacno);
@@ -665,7 +665,7 @@ fn dac_cfg(dacno: u8) -> Result<(), &'static str> {
}
if read(ad9154_reg::INITLANESYNCFLG) != 0xff {
return Err("bad INITLANESYNCFLG")
- }
+ }*/
Ok(())
}
@@ -730,12 +730,12 @@ pub fn init() -> Result<(), &'static str> {
dac_detect(dacno)?;
dac_cfg_retry(dacno)?;
// Run the PRBS, STPL and SYSREF scan tests
- dac_prbs(dacno)?;
+ /*dac_prbs(dacno)?;
dac_stpl(dacno, 4, 2)?;
dac_sysref_scan(dacno);
// Set SYSREF phase and reconfigure the DAC
dac_sysref_cfg(dacno, 88);
- dac_cfg_retry(dacno)?;
+ dac_cfg_retry(dacno)?;*/
}
Ok(())
diff --git a/artiq/firmware/runtime/main.rs b/artiq/firmware/runtime/main.rs
index f5c0a0202..6a580d37d 100644
--- a/artiq/firmware/runtime/main.rs
+++ b/artiq/firmware/runtime/main.rs
@@ -110,6 +110,7 @@ fn startup() {
board_artiq::hmc830_7043::init().expect("cannot initialize HMC830/7043");
#[cfg(has_ad9154)]
board_artiq::ad9154::init().expect("cannot initialize AD9154");
+ unsafe { boot::jump(0); }
#[cfg(has_allaki_atts)]
board_artiq::hmc542::program_all(8/*=4dB*/);
Comment out jesd_unreset()
and the board behaves normally.
@sbourdeauducq: in your first case, the HMC830 and HMC7043 are initialized, but the buffers on the FPGA inputs are still disabled. (We enable them when initializing the DAC). If HMC7043 is now no longer supposed to generate broadband noise, we could remove the enable on the buffers: https://github.com/m-labs/artiq/blob/master/artiq/gateware/targets/sayma_amc.py#L62 https://github.com/m-labs/artiq/blob/master/artiq/gateware/targets/sayma_amc.py#L72
I think that would be interesting to do your first test with the buffers always enabled.
@sbourdeauducq: so let's remove the enables on the buffers (or add a separate control for them) and redo your first test. If that's still working fine, then it's related to the JESD/DACs. If not, then to the HMC830/HMC7043.
If HMC7043 is now no longer supposed to generate broadband noise, we could remove the enable on the buffers:
With the pull-ups and RESET line, it doesn't generate noise. However, it does generate a clock at the wrong frequency for a short time during the HMC7043 init. Not sure how much of a problem that is.
I think that would be interesting to do your first test with the buffers always enabled.
Yes.
With this patch the board simply never boots.
diff --git a/artiq/firmware/runtime/main.rs b/artiq/firmware/runtime/main.rs
index f5c0a0202..560aa0de2 100644
--- a/artiq/firmware/runtime/main.rs
+++ b/artiq/firmware/runtime/main.rs
@@ -108,6 +108,7 @@ fn startup() {
#[cfg(has_hmc830_7043)]
/* must be the first SPI init because of HMC830 SPI mode selection */
board_artiq::hmc830_7043::init().expect("cannot initialize HMC830/7043");
+ unsafe { boot::jump(0); }
#[cfg(has_ad9154)]
board_artiq::ad9154::init().expect("cannot initialize AD9154");
#[cfg(has_allaki_atts)]
diff --git a/artiq/gateware/targets/sayma_amc.py b/artiq/gateware/targets/sayma_amc.py
index 6f34e8663..17f5b175d 100755
--- a/artiq/gateware/targets/sayma_amc.py
+++ b/artiq/gateware/targets/sayma_amc.py
@@ -59,7 +59,7 @@ class AD9154CRG(Module, AutoCSR):
platform.add_period_constraint(refclk_pads.p, 1e9/self.refclk_freq)
self.specials += [
- Instance("IBUFDS_GTE3", i_CEB=self.jreset.storage, p_REFCLK_HROW_CK_SEL=0b00,
+ Instance("IBUFDS_GTE3", i_CEB=0, p_REFCLK_HROW_CK_SEL=0b00,
i_I=refclk_pads.p, i_IB=refclk_pads.n,
o_O=self.refclk, o_ODIV2=refclk2),
Instance("BUFG_GT", i_I=refclk2, o_O=self.cd_jesd.clk),
@@ -67,9 +67,7 @@ class AD9154CRG(Module, AutoCSR):
]
jref = platform.request("dac_sysref")
- self.specials += Instance("IBUFDS_IBUFDISABLE",
- p_USE_IBUFDISABLE="TRUE", p_SIM_DEVICE="ULTRASCALE",
- i_IBUFDISABLE=self.jreset.storage,
+ self.specials += Instance("IBUFDS",
i_I=jref.p, i_IB=jref.n,
o_O=self.jref)
Bootloader CRC passed
Gateware ident 4.0.dev+1148.gf385add8.dirty
Initializing SDRAM...
DQS initial delay: 90 taps
Write leveling scan:
Module 3:
00000000000000000000000000000000000000000000000000000000000000000000000001011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101111001001000001000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000100000000000001100001000111111011111111
Module 2:
00000000000000000000000000000000000000000000000000000000100011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111101011000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001000001000011011101110111111111111111111111111
Module 1:
00000000000000000000000000000000000000000000000000000000110001011111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111110110111001100000011100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000001101001111001110111111111111111111111
Module 0:
00000000000000000000000000000000000000000000000000011111110111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000010000000100000011011000001111011101110111111111111111111111
DQS initial delay: 90 taps
Write leveling: 46 51 56 62 done
Read leveling scan:
Module 3:
00000000000000000000000000000000010000000000010110101100011101010101011111100111011111111111011111111111111111111111111111111111111111111111111111111111011111111101010111111111000010110000010010000000000000010000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 2:
00000000000000010100000000001000100000010010010010100110011111111111111111011111111111111111111111111111111111111111111111111111111110111111111111111111101100011001111110110111010010000000010000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 1:
00000000000000000100000010000000001011111111110111111111111111111111111111111111111111111111111111111111111111111111111111111111111101111111011110111100010010010100001010000000100010000010100000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Module 0:
11001011101111000101011101011101111111110111111101111111111111111111111111111111111111111111111111111111111111111111111011111111011001000001100100100000100000001000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
Read leveling: 113+-7 98+-15 90+-7 74+-11 done
SDRAM initialized
Memory test failed (4670/4458496 words incorrect)
Halting.
@sbourdeauducq interesting! The conclusion here is that the HMC7043 is still interfering with the FPGA, at least after recovering from a crash -- even with the RESET connected and pulled to 3V3.
The next obvious thing to do is to look at the HMC7043 outuputs on a scope. I'm happy to do that now if you can give me complete instructions for how to reproduce this, including relevant binaries.
Simply apply the patch above - then it doesn't boot at all and fails memtest! No need to have a prior crash, this also happens right after a power cycle.
And loading the RTM FPGA, which is supposed to hold the 7043 in reset, does not help.
BTW, all the tests above were done with the LOCs and CLOCK_ROOTs removed, so it's not that interfering either.
https://m-labs.hk/sayma/sayma_crash.tar.bz2 is with the patch above, the LOCs/CLOCK_ROOTs removed, and Vivado 2018.1 But it may be better that you try compiling yourself as well.
Tomorrow I'll try with the RTM disconnected; can you try that too?
I'm at a meeting away from Maryland so can't test further until Monday.
What startup Kernel are you using? And, other than the patch, this is with the current ARTIQ master, right?
What startup Kernel are you using?
Irrelevant.
And, other than the patch, this is with the current ARTIQ master, right?
Yes. f385add8b1
And, without SAWG is ok?
Not tried. I only built with SAWG.
@sbourdeauducq using the binaries you posted above, I see the following: https://drive.google.com/open?id=1UFnlC5iAUXjeOq-2ejaEp8Zt9Ez93hm-
Can you try the to reproduce the reboot loops (and the crashes) here: https://github.com/m-labs/artiq/issues/1065#issuecomment-397527119
@sbourdeauducq Sure.
To be clear:
Also, are you sure that you have the HMC7043 rework done correctly, and that you're holding the chip in reset mode during boot? Can you try removing the AC coupling caps that connect the HMC7043 to the two FPGAs on your board?
The next test you want me to perform is moving the jump(0) to after the DAC reset, while still keeping the FPGA input buffers permanently enabled, right?
No, unmodified bitstream.
BTW: does the test kernel above crash your board?
I don't have the pullup, but if I keep the RTM FPGA loaded, in theory, it should not matter...
No, unmodified bitstream.
You mean taking ARTIQ master and applying only this patch:
diff --git a/artiq/firmware/runtime/main.rs b/artiq/firmware/runtime/main.rs
index f5c0a0202..6a580d37d 100644
--- a/artiq/firmware/runtime/main.rs
+++ b/artiq/firmware/runtime/main.rs
@@ -110,6 +110,7 @@ fn startup() {
board_artiq::hmc830_7043::init().expect("cannot initialize HMC830/7043");
#[cfg(has_ad9154)]
board_artiq::ad9154::init().expect("cannot initialize AD9154");
+ unsafe { boot::jump(0); }
#[cfg(has_allaki_atts)]
board_artiq::hmc542::program_all(8/*=4dB*/);
BTW: does the test kernel above crash your board?
will try that soon.
I don't have the pullup, but if I keep the RTM FPGA loaded, in theory, it should not matter...
true.
You mean taking ARTIQ master and applying only this patch:
Correct.
Okay, I'll build that now (with SAWG).
@sbourdeauducq done. No crashes with that patch either. Log: https://drive.google.com/open?id=1ds1_6zj5BBNqK26HGczKXWkuby6dTMHy
Are you sure that you've applied the HMC7043 rework correctly. Can you patch the runtime to get the RTM to keep the 7043 in reset mode constantly and check that the DAC JESD no links break?
Does this kernel crash on your board? https://github.com/m-labs/artiq/issues/1065#issuecomment-396824032
https://drive.google.com/open?id=1ZyI_0S0IJ-oKc15RgisTW4FABeBQvuub