m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
427 stars 199 forks source link

kasli-soc runtime hangs #2251

Closed jbqubit closed 12 months ago

jbqubit commented 1 year ago

I have been successfully building and loading my own firmware.bin and top.bit for several days now. My recent builds involve a more elaborate .json file that's used in one of the production lab setups. With my most recent builds using this .json the runtime boots but then hangs after 7 seconds. In the output below you can see repeated calls to artiq_coremgmt log that are successful followed by a final request made at t > 18.962370s that returns nothing.

$ artiq_coremgmt log
[     0.000067s]  INFO(runtime): NAR3/Zynq7000 starting...
[     0.005240s]  INFO(runtime): gateware ident: brittonlab-legacy-trap
[     0.016613s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     0.175945s]  WARN(runtime::rtio_clocking): error reading configuration. Falling back to default.
[     0.184799s]  WARN(runtime::rtio_clocking): Using default configuration - internal 125MHz RTIO clock.
[     0.193999s]  INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock
[     0.584719s]  INFO(libboard_artiq::si5324): waiting for Si5324 lock...
[     7.166576s]  INFO(libboard_artiq::si5324):   ...locked
[     7.175986s]  INFO(runtime::rtio_clocking): RTIO PLL locked
[     7.186734s]  INFO(libboard_zynq::i2c): PCA9548 detected
[     7.222095s]  INFO(runtime::comms): network addresses: MAC=fc-0f-e7-07-6b-8c IPv4=192.168.1.76 IPv6-LL=fe80::fe0f:e7ff:fe07:6b8c IPv6: no configured address
[     7.239655s]  INFO(libboard_artiq::drtio_routing): could not read routing table from configuration, using default
[     7.249902s]  INFO(libboard_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; }
[     7.263916s]  INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up
[    11.263081s]  INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full }
[    12.016173s]  INFO(runtime::moninj): received connection
[    14.595996s]  INFO(runtime::mgmt): received connection
[    15.323907s]  INFO(runtime::mgmt): received connection
[    16.107089s]  INFO(runtime::mgmt): received connection
[    16.823447s]  INFO(runtime::mgmt): received connection
[    17.580442s]  INFO(runtime::mgmt): received connection
[    18.259679s]  INFO(runtime::mgmt): received connection
[    18.962370s]  INFO(runtime::mgmt): received connection
(base) britton@brittonlabbuild:~/m-labs/artiq-zynq$ artiq_coremgmt log

The board continues to respond to ping. The Err led is not illuminated. How should I proceed to debug this?

Your System

jbqubit commented 1 year ago

I see the same behavior on a second kasli-soc with an identical time-to-hang.

jbqubit commented 1 year ago

Actually, it's the logging system that's failing. While artiq_coremgmt log is unresponsive I can still use artiq_run. Then sometime later artiq_coremgmt log has emissions which reappear but are truncated. Here's an example.

$ artiq_coremgmt log
[   372.411847s]  WARN(runtime::mgmt): connection terminated: NetworkError(Truncated)
[   386.046133s]  INFO(runtime::mgmt): received connection
[   387.158852s]  INFO(runtime::mgmt): received connection

I've seen this pattern repeat twice now between reboots.

marmeladapk commented 1 year ago

On USB console (/dev/ttyUSB2) you'll be able to see any core panics, these may give you a better idea why it's crashing.

W dniu śro, 11 paź 2023 o 15∶52 użytkownik Joe Britton ***@***.***> napisał:

I have been successfully building and loading my own firmware.bin and top.bit for several days now. My recent builds involve a more elaborate .json file that's used in one of the production lab setups. With my most recent builds using this .json the runtime boots but then hangs after 7 seconds. In the output below you can see repeated calls to artiq_coremgmt log that are successful followed by a final request made at t > 18.962370s that returns nothing. $ artiq_coremgmt log [ 0.000067s] INFO(runtime): NAR3/Zynq7000 starting... [ 0.005240s] INFO(runtime): gateware ident: brittonlab-legacy-trap [ 0.016613s] INFO(libboard_zynq::i2c): PCA9548 detected [ 0.175945s] WARN(runtime::rtio_clocking): error reading configuration. Falling back to default. [ 0.184799s] WARN(runtime::rtio_clocking): Using default configuration - internal 125MHz RTIO clock. [ 0.193999s] INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock [ 0.584719s] INFO(libboard_artiq::si5324): waiting for Si5324 lock... [ 7.166576s] INFO(libboard_artiq::si5324): ...locked [ 7.175986s] INFO(runtime::rtio_clocking): RTIO PLL locked [ 7.186734s] INFO(libboard_zynq::i2c): PCA9548 detected [ 7.222095s] INFO(runtime::comms): network addresses: MAC=fc-0f-e7-07-6b-8c IPv4=192.168.1.76 IPv6-LL=fe80::fe0f:e7ff:fe07:6b8c IPv6: no configured address [ 7.239655s] INFO(libboard_artiq::drtio_routing): could not read routing table from configuration, using default [ 7.249902s] INFO(libboard_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; 4: 4 0; } [ 7.263916s] INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up [ 11.263081s] INFO(libboard_zynq::eth): eth: got Link { speed: S1000, duplex: Full } [ 12.016173s] INFO(runtime::moninj): received connection [ 14.595996s] INFO(runtime::mgmt): received connection [ 15.323907s] INFO(runtime::mgmt): received connection [ 16.107089s] INFO(runtime::mgmt): received connection [ 16.823447s] INFO(runtime::mgmt): received connection [ 17.580442s] INFO(runtime::mgmt): received connection [ 18.259679s] INFO(runtime::mgmt): received connection [ 18.962370s] INFO(runtime::mgmt): received connection (base) @.***:~/m-labs/artiq-zynq$ artiq_coremgmt log

The board continues to respond to ping. The Err led is not illuminated. How should I proceed to debug this? Your System

Kasli-SOC v1.1.1 ARTIQ version: ARTIQ v7.8180.21c6f57 Running in nix development environment for kasli-soc release-7. Boot is via USB-JTAG (via run_local.sh)

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.> [ { @.": "http://schema.org", @.": "EmailMessage", "potentialAction": { @.": "ViewAction", "target": "https://github.com/m-labs/artiq/issues/2251", "url": "https://github.com/m-labs/artiq/issues/2251", "name": "View Issue" }, "description": "View this Issue on GitHub", "publisher": { @.***": "Organization", "name": "GitHub", "url": "https://github.com" } } ]

jbqubit commented 12 months ago

I've tried several times to reproduce this error but it no longer appears. Closing.