Open philipkent opened 6 years ago
Is restarting the dashboard sufficient or do you have to restart the core device as well? Most certainly, you don't have to restart the master. What is the error when trying to reconnect to the device (and e.g. run an experiment) after this problem has occured? What is in the core device log when the problem occurs, and when trying to reconnect?
We will gather some data from the core logs. It may be that there are multiple things going on that are uncorrelated. I'll post better diagnostics when we have them.
@philipkent Do you have more information?
We were able to fix this by running a single network cable directly from the computer’s network adapter to the KC705’s network port, bypassing a network switch we were originally using.
We tested a setup that bypassed the NIST network by using the network switch with only the kc705 and the Linux system running ARTIQ attached. We would still see connection reset errors fairly frequently with that setup. Once we removed the switch and ran a direct connection between the core device and the host the connection reset errors stopped. So, where we originally suspected the NIST network, the culprit turned out to be the switch.
Last I checked we were still periodically getting moninj errors that break the dds/ttl panel. We found that simply restarting the dashboard restores the dds/ttl panel as you said; neither the core device or the master needs to be reset. We can live with the moninj error for now, and we are going move to ARTIQ v3.7 at some point to see if that resolves the moninj problem.
Can we get our hands on that switch? I was asking because it would be good to get bugs like this fixed and not worked around.
I'm out for the next few days, but we still have the switch. I'll look into sending when I get back later this week.
Thanks! Please send to:
M-Labs Limited G/F, 31 Pan Hoi Street Kam Hoi Mansion Quarry Bay, Hong Kong
Phone: +852 59362721
Running artiq version 3.6 on Ubuntu 14.04.5 LTS 64 bit we periodically get a "connection reset by peer" error (error number 104). After the error occurs, the currently running experiment is killed and it seems like the connection is lost between the core device and the DDS/TTL panel because DDS' and TTLs can no longer be controlled using the panel in the dashboard. If we reset the FPGA and restart the artiq_master and artiq_dashboard scripts we then regain control of DDS' and TTLs with the panel. Could this be related to the network stack issues that we're occurring with earlier versions of artiq 3?