m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
437 stars 201 forks source link

Release-7: I2C comms failure with Si5324 on Kasli v1.1 #2567

Open b-bondurant opened 2 months ago

b-bondurant commented 2 months ago

Bug Report

One-Line Summary

Newer release-7 gateware/firmware fails to initialize Si5324 on Kasli v1.1, reportedly because of an I2C failure.

Issue Details

We have a Kasli v1.1 running hardware-based unit tests for DAX. I recently updated its gateware (no change in major version, just a newer rev) and was met with the following:

 __  __ _ ____         ____
|  \/  (_) ___|  ___  / ___|
| |\/| | \___ \ / _ \| |
| |  | | |___) | (_) | |___
|_|  |_|_|____/ \___/ \____|

MiSoC Bootloader
Copyright (c) 2017-2022 M-Labs Limited

Bootloader CRC passed
Gateware ident 7.8208.38c72fd;tester_11
Initializing SDRAM...
Read leveling scan:
Module 1:
00000000000111111111100000000000
Module 0:
00000000000111111111110000000000
Read leveling: 15+-5 16+-5 done
SDRAM initialized
Memory test passed

Booting from flash...
Starting firmware.
[     0.000013s]  INFO(runtime): ARTIQ runtime starting...
[     0.003930s]  INFO(runtime): software ident 7.8208.38c72fd;tester_11
[     0.010295s]  INFO(runtime): gateware ident 7.8208.38c72fd;tester_11
[     0.016653s]  INFO(runtime): log level set to INFO by default
[     0.022390s]  INFO(runtime): UART log level set to INFO by default
[     0.028778s]  WARN(runtime::rtio_clocking): rtio_clock setting not recognised. Falling back to default.
[     0.037967s]  INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock
[     0.314469s]  INFO(board_artiq::si5324): waiting for Si5324 lock...
panic at runtime/rtio_clocking.rs:246:55: cannot initialize Si5324: "Si5324 failed to ack write address"
backtrace for software version 7.8208.38c72fd;tester_11:
0x4002ebbc
0x400083ac
0x40007cd0
0x4002d64c
0x40005b88
0x4001f0f4
0x4001f09c
0x4002dd48
halting.
use `artiq_coremgmt config write -s panic_reset 1` to restart instead

I replicated the same behavior on a second Kasli v1.1. Haven't checked with any newer hardware but I assume it isn't an issue since no one has reported this yet. Haven't checked release-8 yet either.

Searching backward through the release-7 commits, it looks like 25346780bfffe7d6155e58ae2b01403f93eedcf1 is where things break.

Previous commit, c81280174c6e6bd11ce4b6043811f7030f0f5b0c:

 __  __ _ ____         ____
|  \/  (_) ___|  ___  / ___|
| |\/| | \___ \ / _ \| |
| |  | | |___) | (_) | |___
|_|  |_|_|____/ \___/ \____|

MiSoC Bootloader
Copyright (c) 2017-2022 M-Labs Limited

Bootloader CRC passed
Gateware ident 7.8193.c812801;tester_11
Initializing SDRAM...
Read leveling scan:
Module 1:
00000000000111111111100000000000
Module 0:
00000000000111111111110000000000
Read leveling: 15+-5 16+-5 done
SDRAM initialized
Memory test passed

Booting from flash...
Starting firmware.
[     0.000013s]  INFO(runtime): ARTIQ runtime starting...
[     0.003929s]  INFO(runtime): software ident 7.8193.c812801;tester_11
[     0.010292s]  INFO(runtime): gateware ident 7.8193.c812801;tester_11
[     0.016650s]  INFO(runtime): log level set to INFO by default
[     0.022386s]  INFO(runtime): UART log level set to INFO by default
[     0.028775s]  WARN(runtime::rtio_clocking): rtio_clock setting not recognised. Falling back to default.
[     0.037965s]  INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock
[     0.314464s]  INFO(board_artiq::si5324): waiting for Si5324 lock...
[     4.538995s]  INFO(board_artiq::si5324):   ...locked
[     4.568118s]  INFO(runtime): network addresses: MAC=54-10-ec-34-dd-65 IPv4=192.168.1.70 IPv6-LL=fe80::56
10:ecff:fe34:dd65 IPv6=no configured address
[     4.581919s]  INFO(runtime::mgmt): management interface active
[     4.594070s]  INFO(runtime::session): accepting network sessions
[     4.607324s]  INFO(runtime::session): running startup kernel
[     4.611780s]  INFO(runtime::session): no startup kernel found
[     4.617601s]  INFO(runtime::session): no connection, starting idle kernel
[     4.624432s]  INFO(runtime::session): no idle kernel found

\@ 25346780bfffe7d6155e58ae2b01403f93eedcf1:

 __  __ _ ____         ____
|  \/  (_) ___|  ___  / ___|
| |\/| | \___ \ / _ \| |
| |  | | |___) | (_) | |___
|_|  |_|_|____/ \___/ \____|

MiSoC Bootloader
Copyright (c) 2017-2022 M-Labs Limited

Bootloader CRC passed
Gateware ident 7.8194.2534678;tester_11
Initializing SDRAM...
Read leveling scan:
Module 1:
00000000000111111111100000000000
Module 0:
00000000000111111111110000000000
Read leveling: 15+-5 16+-5 done
SDRAM initialized
Memory test passed

Booting from flash...
Starting firmware.
[     0.000013s]  INFO(runtime): ARTIQ runtime starting...
[     0.003932s]  INFO(runtime): software ident 7.8194.2534678;tester_11
[     0.010297s]  INFO(runtime): gateware ident 7.8194.2534678;tester_11
[     0.016655s]  INFO(runtime): log level set to INFO by default
[     0.022392s]  INFO(runtime): UART log level set to INFO by default
[     0.028779s]  WARN(runtime::rtio_clocking): rtio_clock setting not recognised. Falling back to default.
[     0.037969s]  INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock
[     0.314469s]  INFO(board_artiq::si5324): waiting for Si5324 lock...
panic at runtime/rtio_clocking.rs:246:55: cannot initialize Si5324: "Si5324 failed to ack write address"
backtrace for software version 7.8194.2534678;tester_11:
0x4002ebbc
0x40008530
0x40007cd0
0x4002d64c
0x40005b88
0x4001f0f4
0x4001f09c
0x4002dd48
halting.
use `artiq_coremgmt config write -s panic_reset 1` to restart instead

Full logs with each revision tested: https://pastebin.com/fyDyV8vA

Steps to Reproduce

  1. $ nix develop 'git+https://github.com/m-labs/artiq?ref=release-7&rev=<rev-to-test>'
  2. $ python -m artiq.gateware.targets.kasli_generic tester_11.json (json here)
  3. $ artiq_flash --srcbuild -d artiq_kasli/tester_11/
  4. Monitor serial output on boot

Expected Behavior

The system initializes.

Actual (undesired) Behavior

The system doesn't initialize.

Your System (omit irrelevant parts)

dnadlinger commented 2 months ago

Commit https://github.com/m-labs/artiq/commit/25346780bfffe7d6155e58ae2b01403f93eedcf1 looks like a false positive; a change to the compiler shouldn't have any effect on the Rust runtime.

b-bondurant commented 2 months ago

Yeah, I noticed the lack of relevant changes in that commit and thought it odd as well. If anything, I would have expected the previous commit to be the culprit, but I've done two builds of that gateware (thanks to my rm -rfing between tests) deployed to two different Kaslis and both worked. Haven't established the breaking commit with the same rigor though.


From: David Nadlinger @.> Sent: Friday, August 30, 2024 4:04:10 PM To: m-labs/artiq @.> Cc: Brad Bondurant, Ph.D. @.>; Author @.> Subject: Re: [m-labs/artiq] Release-7: I2C comms failure with Si5324 on Kasli v1.1 (Issue #2567)

Commit 2534678https://urldefense.com/v3/__https://github.com/m-labs/artiq/commit/25346780bfffe7d6155e58ae2b01403f93eedcf1__;!!OToaGQ!qAFSGZ1yZlYnKPag8uAGUcPAw_mpCcP2mP2VMCpYKL-aBHgfjJ1PTgj2Ya2LMNCMGImyzI27mlt20h9s2_aum1GBmZCxGw$ looks like a false positive; a change to the compiler shouldn't have any effect on the Rust runtime.

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/m-labs/artiq/issues/2567*issuecomment-2322254468__;Iw!!OToaGQ!qAFSGZ1yZlYnKPag8uAGUcPAw_mpCcP2mP2VMCpYKL-aBHgfjJ1PTgj2Ya2LMNCMGImyzI27mlt20h9s2_aum1Etnjmniw$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKM2KYDSWNEMAOOAQPUKEBTZUDF3VAVCNFSM6AAAAABNNAJ3B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGI2TINBWHA__;!!OToaGQ!qAFSGZ1yZlYnKPag8uAGUcPAw_mpCcP2mP2VMCpYKL-aBHgfjJ1PTgj2Ya2LMNCMGImyzI27mlt20h9s2_aum1Ffvf46Jg$. You are receiving this because you authored the thread.Message ID: @.***>

dnadlinger commented 2 months ago

Is the gateware bitstream/firmware build even different at all? I guess with gateware there is always the chance of two non-deterministic optimisation runs resulting in subtly different outcomes…

b-bondurant commented 2 months ago

Had to leave early to go out of town for the weekend but I'll compare once I get back. I guess I could ramp up nix's sandboxing (--pure and --restrict-eval off the top of my head) as well.


From: David Nadlinger @.> Sent: Friday, August 30, 2024 5:15:32 PM To: m-labs/artiq @.> Cc: Brad Bondurant, Ph.D. @.>; Author @.> Subject: Re: [m-labs/artiq] Release-7: I2C comms failure with Si5324 on Kasli v1.1 (Issue #2567)

Is the gateware bitstream/firmware build even different at all? I guess with gateware there is always the chance of two non-deterministic optimisation runs resulting in subtly different outcomes…

— Reply to this email directly, view it on GitHubhttps://urldefense.com/v3/__https://github.com/m-labs/artiq/issues/2567*issuecomment-2322347736__;Iw!!OToaGQ!tUCSSnDnvBcdpKTDY3rJ2hyA8koyTjRAQx54Ud7gMpQJzakRpltTOJAp09VDEqaHij0yZHHV9-qQLbyfcaa349FOD67vHg$, or unsubscribehttps://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AKM2KYBXCI5EBDINLDOFDHDZUDOHJAVCNFSM6AAAAABNNAJ3B2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMRSGM2DONZTGY__;!!OToaGQ!tUCSSnDnvBcdpKTDY3rJ2hyA8koyTjRAQx54Ud7gMpQJzakRpltTOJAp09VDEqaHij0yZHHV9-qQLbyfcaa349H2rl_qSw$. You are receiving this because you authored the thread.Message ID: @.***>

b-bondurant commented 2 months ago

Bitstream:

$ diff -q artiq_kasli_7.8193.c812801/tester_11/gateware/top.bit artiq_kasli_7.8194.2534678/tester_11/gateware/top.bit
Files artiq_kasli_7.8193.c812801/tester_11/gateware/top.bit and artiq_kasli_7.8194.2534678/tester_11/gateware/top.bit differ

Runtime:

$ diff -q artiq_kasli_7.8193.c812801/tester_11/software/runtime/runtime.bin artiq_kasli_7.8194.2534678/tester_11/software/runtime/runtime.bin
Files artiq_kasli_7.8193.c812801/tester_11/software/runtime/runtime.bin and artiq_kasli_7.8194.2534678/tester_11/software/runtime/runtime.bin differ

Building in a more strict environment, nix develop ... --sandbox --pure-eval --ignore-environment --keep HOME (sandboxing should be on by default, but just in case; HOME required to make Vivado happy):

$ diff -q artiq_kasli_7.8193.c812801_pure/tester_11/gateware/top.bit artiq_kasli_7.8194.2534678_pure/tester_11/gateware/top.bit
Files artiq_kasli_7.8193.c812801_pure/tester_11/gateware/top.bit and artiq_kasli_7.8194.2534678_pure/tester_11/gateware/top.bit differ

$ diff -q artiq_kasli_7.8193.c812801_pure/tester_11/software/runtime/runtime.bin artiq_kasli_7.8194.2534678_pure/tester_11/software/runtime/runtime.bin
Files artiq_kasli_7.8193.c812801_pure/tester_11/software/runtime/runtime.bin and artiq_kasli_7.8194.2534678_pure/tester_11/software/runtime/runtime.bin differ

No clue why :man_shrugging:

b-bondurant commented 2 months ago

Latest release-8 works fine:

 __  __ _ ____         ____
|  \/  (_) ___|  ___  / ___|
| |\/| | \___ \ / _ \| |
| |  | | |___) | (_) | |___
|_|  |_|_|____/ \___/ \____|

MiSoC Bootloader
Copyright (c) 2017-2024 M-Labs Limited

Bootloader CRC passed
Gateware ident 8.8955+0ac9e77;tester_11
Initializing SDRAM...
Read leveling scan:
Module 1:
00000001111111110000000000000000
Module 0:
00000011111111111000000000000000
Read leveling: 11+-4 11+-5 done
SDRAM initialized
Memory test passed

Booting from flash...
Starting firmware.
[     0.000012s]  INFO(runtime): ARTIQ runtime starting...
[     0.003899s]  INFO(runtime): software ident 8.8955+0ac9e77;tester_11
[     0.010245s]  INFO(runtime): gateware ident 8.8955+0ac9e77;tester_11
[     0.016594s]  INFO(runtime): log level set to INFO by default
[     0.022312s]  INFO(runtime): UART log level set to INFO by default
[     0.028683s]  WARN(runtime::rtio_clocking): rtio_clock setting not recognised. Falling back to default.
[     0.037850s]  INFO(runtime::rtio_clocking): Clocking has already been set up.
[     0.070364s]  INFO(runtime): network addresses: MAC=54-10-ec-34-dd-65 IPv4=10.236.88.210/0 IPv6-LL=fe80:
:5610:ecff:fe34:dd65/10 IPv6=no configured address
[     0.083182s]  WARN(runtime::rtio_mgt): error reading device map (key not found), device names will not b
e available in RTIO error messages
[     0.095441s]  INFO(runtime::rtio_mgt): SED spreading disabled by default
[     0.103423s]  INFO(runtime::mgmt): management interface active
[     0.114721s]  INFO(runtime::session): accepting network sessions
[     0.119457s]  INFO(runtime::session): running startup kernel
[     0.125114s]  INFO(runtime::session): no startup kernel found
[     0.130832s]  INFO(runtime::session): no connection, starting idle kernel
[     0.145124s]  INFO(runtime::session): no idle kernel found