m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
430 stars 200 forks source link

Connection reset by peer error breaks DDS/TTL panel #1134

Open philipkent opened 6 years ago

philipkent commented 6 years ago

Running artiq version 3.6 on Ubuntu 14.04.5 LTS 64 bit we periodically get a "connection reset by peer" error (error number 104). After the error occurs, the currently running experiment is killed and it seems like the connection is lost between the core device and the DDS/TTL panel because DDS' and TTLs can no longer be controlled using the panel in the dashboard. If we reset the FPGA and restart the artiq_master and artiq_dashboard scripts we then regain control of DDS' and TTLs with the panel. Could this be related to the network stack issues that we're occurring with earlier versions of artiq 3?

sbourdeauducq commented 6 years ago

Is restarting the dashboard sufficient or do you have to restart the core device as well? Most certainly, you don't have to restart the master. What is the error when trying to reconnect to the device (and e.g. run an experiment) after this problem has occured? What is in the core device log when the problem occurs, and when trying to reconnect?

philipkent commented 6 years ago

We will gather some data from the core logs. It may be that there are multiple things going on that are uncorrelated. I'll post better diagnostics when we have them.

sbourdeauducq commented 5 years ago

@philipkent Do you have more information?

philipkent commented 5 years ago

We were able to fix this by running a single network cable directly from the computer’s network adapter to the KC705’s network port, bypassing a network switch we were originally using.

We tested a setup that bypassed the NIST network by using the network switch with only the kc705 and the Linux system running ARTIQ attached. We would still see connection reset errors fairly frequently with that setup. Once we removed the switch and ran a direct connection between the core device and the host the connection reset errors stopped. So, where we originally suspected the NIST network, the culprit turned out to be the switch.

Last I checked we were still periodically getting moninj errors that break the dds/ttl panel. We found that simply restarting the dashboard restores the dds/ttl panel as you said; neither the core device or the master needs to be reset. We can live with the moninj error for now, and we are going move to ARTIQ v3.7 at some point to see if that resolves the moninj problem.

sbourdeauducq commented 5 years ago

Can we get our hands on that switch? I was asking because it would be good to get bugs like this fixed and not worked around.

philipkent commented 5 years ago

I'm out for the next few days, but we still have the switch. I'll look into sending when I get back later this week.

sbourdeauducq commented 5 years ago

Thanks! Please send to:

M-Labs Limited G/F, 31 Pan Hoi Street Kam Hoi Mansion Quarry Bay, Hong Kong

Phone: +852 59362721