Connection reset by peer error

dleibrandt commented 8 years ago

The experiment below successfully goes through the n_scan_points for loop a random number of times (of order a few dozen), then terminates with the following error message. I'm running 1.0rc2 on linux.

root:Terminating with exception (ConnectionResetError: [Errno 104] Connection reset by peer)
Traceback (most recent call last):
  File "/home/rabi/anaconda3/envs/artiq-2016-04-04/lib/python3.5/site-packages/artiq/master/worker_impl.py", line 231, in main
    exp_inst.run()
  File "/home/rabi/artiq-work/detect.py", line 227, in run
    self.run_experiments()
  File "/home/rabi/anaconda3/envs/artiq-2016-04-04/lib/python3.5/site-packages/artiq/language/core.py", line 192, in run_on_core
    return getattr(self, arg).run(run_on_core, ((self,) + k_args), k_kwargs)
  File "/home/rabi/anaconda3/envs/artiq-2016-04-04/lib/python3.5/site-packages/artiq/coredevice/core.py", line 106, in run
    self.comm.load(kernel_library)
  File "/home/rabi/anaconda3/envs/artiq-2016-04-04/lib/python3.5/site-packages/artiq/coredevice/comm_generic.py", line 293, in load
    self._read_empty(_D2HMsgType.LOAD_COMPLETED)
  File "/home/rabi/anaconda3/envs/artiq-2016-04-04/lib/python3.5/site-packages/artiq/coredevice/comm_generic.py", line 132, in _read_empty
    self._read_header()
  File "/home/rabi/anaconda3/envs/artiq-2016-04-04/lib/python3.5/site-packages/artiq/coredevice/comm_generic.py", line 104, in _read_header
    (sync_byte, ) = struct.unpack("B", self.read(1))
  File "/home/rabi/anaconda3/envs/artiq-2016-04-04/lib/python3.5/site-packages/artiq/coredevice/comm_tcp.py", line 58, in read
    rn = self.socket.recv(min(8192, length - len(r)))

class TestHandover(EnvExperiment):
    """Test handover"""

    def build(self):
        # Get the hardware devices
        self.setattr_device("scheduler")

        self.setattr_device("core")
        self.setattr_device("core_dds")

        self.setattr_argument("n_bins", NumberValue(50, step=1, ndecimals=0))
        self.setattr_argument("n_experiments", NumberValue(100, step=1, ndecimals=0))
        self.setattr_argument("n_scan_points", NumberValue(1000, step=1, ndecimals=0))

    def run(self):
        for i in range(int(self.n_scan_points)):
            print(i)

            self.hist = [0 for _ in range(int(self.n_bins))]
            self.total = 0

            self.random_numbers = [np.random.poisson(20) for i in range(int(self.n_experiments))]

            self.run_experiments()
            self.core.comm.close()

            self.set_dataset("detect_photon_histogram", np.array(self.hist),
                         broadcast=True, save=True)

            self.scheduler.pause()

    @kernel
    def run_experiments(self):
        for i in range(int(self.n_experiments)):
            delay(1*ms)

            n = int(self.random_numbers[i])
            if n >= int(self.n_bins):
                n = int(self.n_bins) - 1
            self.hist[n] += 1
            self.total += n

r-srinivas commented 8 years ago

I ran into a similar error as well with the following experiment,

from artiq.experiment import *
import numpy as np

class detect_test(EnvExperiment):

    def build(self):
        self.setattr_device("core")
        self.setattr_device("ttl1")
        self.setattr_device("ttl6")
        self.points = 1
        self.input = np.random.rand(self.points)
        self.results = np.zeros((self.points, 1))

    def run(self):
        for i in range(self.points):
            ct = int(20*self.input[i])
            self.results[i] = self.detect(ct)
        print(self.results)

    @kernel
    def detect(self, rand_input):
        with parallel:
            self.ttl6.gate_rising(100*us)
            with sequential:
                for i in range(rand_input):                
                    self.ttl1.pulse(1*us)
                    delay(1*us)
        counts = self.ttl6.count()
        return counts

With the error message,

ERROR:worker(1461,detect_test.py):root:Terminating with exception (ConnectionRes
etError: [WinError 10054] An existing connection was forcibly closed by the remo
te host)
Traceback (most recent call last):
  File "C:\Anaconda3\envs\artiq-2016-04-15\lib\site-packages\artiq\master\worker
_impl.py", line 231, in main
    exp_inst.run()
  File "C:\Anaconda3\artiq_test\repository\test_experiments\detect_test.py", lin
e 24, in run
    self.results[i] = self.detect(ct)
  File "C:\Anaconda3\envs\artiq-2016-04-15\lib\site-packages\artiq\language\core
.py", line 192, in run_on_core
    return getattr(self, arg).run(run_on_core, ((self,) + k_args), k_kwargs)
  File "C:\Anaconda3\envs\artiq-2016-04-15\lib\site-packages\artiq\coredevice\co
re.py", line 108, in run
    self.comm.serve(object_map, symbolizer)
  File "C:\Anaconda3\envs\artiq-2016-04-15\lib\site-packages\artiq\coredevice\co
mm_generic.py", line 523, in serve
    self._read_header()
  File "C:\Anaconda3\envs\artiq-2016-04-15\lib\site-packages\artiq\coredevice\co
mm_generic.py", line 104, in _read_header
    (sync_byte, ) = struct.unpack("B", self.read(1))
  File "C:\Anaconda3\envs\artiq-2016-04-15\lib\site-packages\artiq\coredevice\co
mm_tcp.py", line 58, in read
    rn = self.socket.recv(min(8192, length - len(r)))
ConnectionResetError: [WinError 10054] An existing connection was forcibly close
d by the remote host
ERROR:master:artiq.master.scheduler:got worker exception in run stage, deleting
RID 1461

Restarting the FPGA seemed to fix it. Not sure what caused it.

whitequark commented 8 years ago

When you get this kind of error, it is helpful to examine the coredevice log (via artiq_corelog).

I think we should do this automatically.

r-srinivas commented 8 years ago

Okay, it's not quite the same. I just get a reset error when calling self.core.comm.close(), which I guess is different from what Dave got.

    @kernel
    def detect(self, rand_input):
        with parallel:
            self.ttl6.gate_rising(100*us)
            with sequential:
                for i in range(rand_input):                
                    self.ttl1.pulse(1*us)
                    delay(1*us)
        counts = self.ttl6.count()
        self.core.comm.close()
        return counts

Causes the error.

dleibrandt commented 8 years ago

Restarting the FPGA doesn't fix my problem. Running corelog after the error spits out:

Startup RTIO clock: internal

sbourdeauducq commented 8 years ago

@r-srinivas What do you expect closing the core device connection from a kernel to do?

sbourdeauducq commented 8 years ago

@dleibrandt Just went through >700 iterations without any problem (using ARTIQ 2.0/master). Is your network generally working well?

sbourdeauducq commented 8 years ago

@dleibrandt Are you able to reproduce the problem? With 1.0rc3? Can you monitor the connections with wireshark when it happens?

dleibrandt commented 8 years ago

The problem is still present with 1.0rc3.

Wireshark seems kind of complicated to set up, so here are some simple tests for now:

My setup currently has the the the computer's ethernet port going to a switch (dlink dgs-2205). Other ports of the switch are connected to the local network and the FPGA. Disconnecting the local network from the switch doesn't fix the problem. Plugging the FPGA directly into the computer's ethernet port (bypassing the switch) does fix the problem (I just ran 1000 successful iterations). Getting rid of the switch and plugging both the computer and the FPGA directly into the local network doesn't fix the problem.

So there seems to be a problem related to the NIST network. Any ideas? Perhaps the simplest fix is to get a second ethernet card in my computer?

jordens commented 8 years ago

tcpdump -s0 -w kc705.pcap ip host YOUR-KC705-IP-OR-HOSTNAME should do it. You can load that kc705.pcap into wireshark and/or send it to us.

dleibrandt commented 8 years ago

OK, I just emailed the dump from the above command to @jordens and @sbourdeauducq.

sbourdeauducq commented 8 years ago

FWIW we are using a TPLink TL-WR841N here (and OpenWrt), the KC705 is on one port and the control PC on another or over WiFi.

jordens commented 8 years ago

Looks like the jumbo frames are a problem. But I don't know yet who is to fault or what the best solution is. Disabling jumbo frames on that interface on windows should be a workaround.

sbourdeauducq commented 8 years ago

Jumbo frames are definitely going to break in LiteEth, which uses this for sizing the packet buffers: https://github.com/m-labs/misoc/blob/8c0e0ff43d8937aac2a71fb4eca077b0795825dc/misoc/cores/liteeth_mini/common.py#L7

Cc @enjoy-digital

dnadlinger commented 8 years ago

Shouldn't the maximum MTU be discovered automatically first even when jumbo frames are enabled on the interface?

jordens commented 8 years ago

that relies on everybody doing the right thing

sbourdeauducq commented 8 years ago

For this to work, it seems the core device should send back an ICMP Type 3 message upon receiving a jumbo frame. Right now liteeth corrupts jumbo frames, and upper layers drop them (silently).

jordens commented 8 years ago

It doesn't drop them. It actually acks first 1475 and then 2935 octets of the 4395 octet frame and then the TCP machinery times out and picks up again. That's wrong and also costs 200 ms on every kernel upload. The fact that the difference between the first two acks is 1500 could hint towards some wrapping. There is probably another bug that then -- in some cases -- causes the second ack to not appear and lwip to reset the session.

sbourdeauducq commented 8 years ago

The connection should not fail anymore. But it's not sending back ICMP frames yet when the MTU is exceeded, so there may still be a significant latency increase when jumbo frames are enabled.

sbourdeauducq commented 8 years ago

If the control PC's operating system honors TCP MSS, then there should be no latency increase (MSS was also broken before this patch).

sbourdeauducq commented 8 years ago

1.0rc4 is in conda and should fix this.

m-labs / artiq

Connection reset by peer error #398