m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
434 stars 200 forks source link

Unreliable network connection on Kasli v2.0.2 and ARTIQ-7 #2286

Open kaolpr opened 11 months ago

kaolpr commented 11 months ago

Bug Report

One-Line Summary

With ARTIQ-7 (release-7) branch Ethernet connection to Kasli v2.0.2 mostly fails, however with ARTIQ-8 (master) it works at all times.

Issue Details

Steps to Reproduce

  1. Build Kasli firmware with ARTIQ-7.8185.cc81464
  2. Flash Kasli with firmware and configure with storage file to predefined IP (only IP is written in the storage area file)
  3. Observe ping output
  4. Power cycle several times

Expected Behavior

Kasli responds to ping at all times.

Actual (undesired) Behavior

MiSoC Bootloader Copyright (c) 2017-2022 M-Labs Limited

Bootloader CRC passed Gateware ident 7.8185.cc81464;test Initializing SDRAM... Read leveling scan: Module 1: 00000000000001111111110000000000 Module 0: 00000000000001111111111000000000 Read leveling: 17+-4 17+-5 done SDRAM initialized Memory test passed

Booting from flash... Starting firmware. [ 0.000016s] INFO(runtime): ARTIQ runtime starting... [ 0.003935s] INFO(runtime): software ident 7.8185.cc81464;test [ 0.009864s] INFO(runtime): gateware ident 7.8185.cc81464;test [ 0.015801s] INFO(runtime): log level set to INFO by default [ 0.021531s] INFO(runtime): UART log level set to INFO by default [ 0.179750s] WARN(runtime::rtio_clocking): rtio_clock setting not recognised. Falling back to default. [ 0.187853s] INFO(runtime::rtio_clocking): using internal 125MHz RTIO clock [ 0.464337s] INFO(board_artiq::si5324): waiting for Si5324 lock... [ 8.341259s] INFO(board_artiq::si5324): ...locked [ 8.371941s] INFO(runtime): network addresses: MAC=fc-0f-e7-07-33-ce IPv4=192.168.1.70 IPv6-LL=fe80::fe0f:e7ff:fe07:33ce IPv6=no configured address [ 8.385637s] INFO(board_artiq::drtio_routing): could not read routing table from configuration, using default [ 8.394351s] INFO(board_artiq::drtio_routing): routing table: RoutingTable { 0: 0; 1: 1 0; 2: 2 0; 3: 3 0; } [ 8.407962s] INFO(runtime::mgmt): management interface active [ 8.420022s] INFO(runtime::session): accepting network sessions [ 8.433112s] INFO(runtime::session): running startup kernel [ 8.437568s] INFO(runtime::session): no startup kernel found [ 8.443382s] INFO(runtime::session): no connection, starting idle kernel [ 8.450215s] INFO(runtime::session): no idle kernel found [ 8.455639s] INFO(runtime::rtio_mgt::drtio): [DEST#0] destination is up


* Link status LED on connected switch lights up every time, however status LED on Kasli mostly remains off. Sometimes it lights up, however it does not necessarily mean that Kasli will respond to ping (it happened that it did not respond to ping with status LED on).
* **The same hardware setup works perfectly fine with ARTIQ-8.8573+b168f0b.beta, tested with 100 power cycles, responds to ping every time.**
* With power cycle I mean: power on, wait for device to boot (15s), ping 3 times (`ping -c 4 -B -W 30`), note result, power off, wait 30s.

### Your System (omit irrelevant parts)

* Operating System: Ubuntu 22.04
* ARTIQ version: ARTIQ v7.8185.cc81464 / master
* Hardware involved:
  * Kasli v2.0.2
  * set of peripheral boards (Urukuls, Mirnys, DIO)

<!--
For in-depth information on bug reporting, see:

http://www.chiark.greenend.org.uk/~sgtatham/bugs.html https://developer.mozilla.org/en-US/docs/Mozilla/QA/Bug_writing_guidelines
-->
kaolpr commented 11 months ago

I've changed base to standalone. Out of 30 power cycles, 4 failed.

marmeladapk commented 8 months ago

It may, or may not be the same error that I faced several times, it seemed like some part of Ethernet chain required an additional "reset".

In my case I ran continuous ping on Kasli's address and power cycled the board. Usually (>~90%) Kasli would not respond to pings until I run artiq_flash start or disconnected and reconnected ethernet cable. Most recent case was in master configuration, release-7, however I also encountered this in standalone configurations.

It also seemed to be device specific, as in my case the same gateware flashed to another Kasli worked fine.