m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
434 stars 201 forks source link

FPGA refuses connection #1100

Closed dhslichter closed 6 years ago

dhslichter commented 6 years ago

Running artiq 3.6 with a KC705 at NIST, we get errors every ~10-20 minutes where communication between core device and PC fails and the current kernel is ejected. Usually the next experiment submitted is OK. Today it's gone farther and will not receive new kernels. Rebooting the FPGA fixes the issue. Traceback below:

ERROR:worker(18758,single_dds_setter.py):root:Terminating` with exception (ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it)
Traceback (most recent call last):
  File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\master\worker_impl.py", line 279, in main
    exp_inst.run()
  File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\language\core.py", line 53, in run_on_core
    return getattr(self, arg).run(run_on_core, ((self,) + k_args), k_kwargs)
  File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\coredevice\core.py", line 127, in run
    self.comm.check_system_info()
  File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 277, in check_system_info
    self._write_empty(_H2DMsgType.SYSTEM_INFO_REQUEST)
  File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 242, in _write_empty
    self._write_header(ty)
  File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 234, in _write_header
    self.open()
  File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 147, in open
    self.socket = initialize_connection(self.host, self.port)
  File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 105, in initialize_connection
    sock = socket.create_connection((host, port), 5.0)
  File "C:\Anaconda3\envs\artiq3.6\lib\socket.py", line 712, in create_connection
    raise err
  File "C:\Anaconda3\envs\artiq3.6\lib\socket.py", line 703, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [WinError 10061] No connection could be made because the target machine actively refused it
ERROR:master:artiq.master.scheduler:got worker exception in run stage, deleting RID 18758
sbourdeauducq commented 6 years ago

Anything in the board log/UART?

jordens commented 6 years ago

Sounds like https://github.com/m-labs/artiq/issues/986 . doesn't look like it was cherry-picked for release-3

sbourdeauducq commented 6 years ago

It was cherry-picked (9589decd54), but the fix also contained a LLVM update and caused https://github.com/m-labs/artiq/issues/1007. There has been no release since.

dhslichter commented 6 years ago

@sbourdeauducq haven't looked at the board log or UART, will do so when the event repeats. Does this need to wait for a release until #1007 is addressed or can it come out earlier?

dhslichter commented 6 years ago

Another traceback, for reference. This has not caused a total crash requiring an FPGA reboot, just killed the currently running kernel, exiting with the error message below:

ERROR:worker(18839,ion_monitor.py):root:Terminating with exception (ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host)
Traceback (most recent call last):
  File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\master\worker_impl.py", line 279, in main
    exp_inst.run()
  File "c:\artiq_cryo1\repository\experiments\ion_monitor.py", line 61, in run
    self.detloop()
  File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\language\core.py", line 53, in run_on_core
    return getattr(self, arg).run(run_on_core, ((self,) + k_args), k_kwargs)
  File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\coredevice\core.py", line 133, in run
    self.comm.serve(embedding_map, symbolizer, demangler)
  File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 599, in serve
    self._read_header()
  File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 179, in _read_header
    (sync_byte, ) = struct.unpack("B", self.read(1))
  File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 160, in read
    rn = self.socket.recv(min(8192, length - len(r)))
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
ERROR:master:artiq.master.scheduler:got worker exception in run stage, deleting RID 18839

Last few minutes of the corelog record downloaded here:

[ 23803.546057s]  INFO(runtime::kern_hwreq): resetting RTIO
[ 23804.364343s]  WARN(runtime): network error: malformed packet
[ 23804.574733s]  WARN(runtime): network error: malformed packet
[ 23804.606177s]  WARN(runtime): network error: malformed packet
[ 23810.118792s]  INFO(runtime::session): no connection, starting idle kernel
[ 23810.124838s]  INFO(runtime::session): no idle kernel found
[ 23811.912328s]  INFO(runtime::session): new connection from 132.163.182.162:51464
[ 23811.963641s]  INFO(runtime::kern_hwreq): resetting RTIO
[ 23833.387247s]  WARN(runtime): network error: malformed packet
[ 23833.498708s]  WARN(runtime): network error: malformed packet
[ 23834.569358s]  WARN(runtime): network error: malformed packet
[ 23856.281737s]  INFO(runtime::session): no connection, starting idle kernel
[ 23856.288163s]  INFO(runtime::session): no idle kernel found
[ 23859.489129s]  INFO(runtime::session): new connection from 132.163.182.162:51465
[ 23859.538776s]  INFO(runtime::kern_hwreq): resetting RTIO
[ 23862.665890s]  WARN(runtime): network error: malformed packet
[ 23863.238021s]  WARN(runtime): network error: malformed packet
[ 23864.435143s]  WARN(runtime): network error: malformed packet
[ 23873.966295s]  INFO(runtime::session): no connection, starting idle kernel
[ 23873.972099s]  INFO(runtime::session): no idle kernel found
[ 23900.164897s]  WARN(runtime): network error: malformed packet
[ 23901.018328s]  WARN(runtime): network error: malformed packet
[ 23901.445735s]  WARN(runtime): network error: malformed packet
[ 23929.535799s]  WARN(runtime): network error: malformed packet
[ 23930.094184s]  WARN(runtime): network error: malformed packet
[ 23930.730374s]  WARN(runtime): network error: malformed packet
[ 23958.804001s]  WARN(runtime): network error: malformed packet
[ 23958.842125s]  WARN(runtime): network error: malformed packet
[ 23960.656888s]  WARN(runtime): network error: malformed packet
[ 24320.891686s]  INFO(runtime::mgmt): new connection from 132.163.182.162:51509
[ 24351.798967s]  INFO(runtime::mgmt): new connection from 132.163.182.162:51510

This "malformed packet" deal tends to show up often...

sbourdeauducq commented 6 years ago

Does this need to wait for a release until #1007 is addressed or can it come out earlier?

You can try a release-3 dev package.

dhslichter commented 6 years ago

Another corelog dump, in case it is useful @whitequark. I started it from a bit before when the error occurred.

[105431.845797s] DEBUG(runtime::session): comm->host RpcRequest { async: false }
[105431.845961s] DEBUG(proto::rpc_proto): send<2>(Object)->Bool
[105431.849261s] DEBUG(runtime::session): comm<-host RpcReply { tag: [98] }
[105431.849961s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4fffd654)
[105431.850109s] DEBUG(proto::rpc_proto): recv ...->Bool
[105431.850234s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0))
[105432.652003s] DEBUG(runtime::session): comm<-kern (async RPC)
[105432.652111s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[105432.652269s] DEBUG(runtime::session): [0, 0, 0, 3, 79, 0, 0, 0, 8, 115, 0, 0, 0, 6, 99, 111, 117, 110, 116, 115, 102, 64, 44, 14, 86, 4, 24, 147, 117, 107, 0, 0, 0, 9, 98, 114, 111, 97, 100, 99, 97, 115, 116, 98, 64, 0, 0, 0, 0, 1, 110]
[105432.653328s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 4, tag: [79, 115, 97, 105, 107, 98, 107, 98, 107, 98, 58, 110], data: 0x4fffd618 }
[105432.653658s] DEBUG(runtime::session): comm->host RpcRequest { async: false }
[105432.653821s] DEBUG(proto::rpc_proto): send<4>(Object, String, Array(Int32), Keyword(Bool), Keyword(Bool), Keyword(Bool))->None
[105432.662564s] DEBUG(runtime::session): comm<-host RpcReply { tag: [110] }
[105432.663265s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4fffd638)
[105432.663416s] DEBUG(proto::rpc_proto): recv ...->None
[105432.663525s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0))
[105432.664370s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 2, tag: [79, 58, 98], data: 0x4fffd644 }
[105432.664622s] DEBUG(runtime::session): comm->host RpcRequest { async: false }
[105432.664785s] DEBUG(proto::rpc_proto): send<2>(Object)->Bool
[105432.668086s] DEBUG(runtime::session): comm<-host RpcReply { tag: [98] }
[105432.668785s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4fffd644)
[105432.668934s] DEBUG(proto::rpc_proto): recv ...->Bool
[105432.669057s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0))
[105433.470723s] DEBUG(runtime::session): comm<-kern (async RPC)
[105433.470832s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[105433.470989s] DEBUG(runtime::session): [0, 0, 0, 3, 79, 0, 0, 0, 8, 115, 0, 0, 0, 6, 99, 111, 117, 110, 116, 115, 102, 64, 43, 57, 88, 16, 98, 77, 211, 107, 0, 0, 0, 9, 98, 114, 111, 97, 100, 99, 97, 115, 116, 98, 64, 0, 0, 0, 0, 1, 110]
[105433.472048s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 4, tag: [79, 115, 97, 105, 107, 98, 107, 98, 107, 98, 58, 110], data: 0x4fffd608 }
[105433.472378s] DEBUG(runtime::session): comm->host RpcRequest { async: false }
[105433.472542s] DEBUG(proto::rpc_proto): send<4>(Object, String, Array(Int32), Keyword(Bool), Keyword(Bool), Keyword(Bool))->None
[105433.481280s] DEBUG(runtime::session): comm<-host RpcReply { tag: [110] }
[105433.481982s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4fffd628)
[105433.482133s] DEBUG(proto::rpc_proto): recv ...->None
[105433.482243s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0))
[105433.482978s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 2, tag: [79, 58, 98], data: 0x4fffd634 }
[105433.483228s] DEBUG(runtime::session): comm->host RpcRequest { async: false }
[105433.483392s] DEBUG(proto::rpc_proto): send<2>(Object)->Bool
[105433.486692s] DEBUG(runtime::session): comm<-host RpcReply { tag: [98] }
[105433.487393s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4fffd634)
[105433.487541s] DEBUG(proto::rpc_proto): recv ...->Bool
[105433.487665s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0))
[105433.987165s] DEBUG(smoltcp::iface::ethernet): address 132.163.182.162 not in neighbor cache, sending ARP request
[105434.289494s] DEBUG(runtime::session): comm<-kern (async RPC)
[105434.289602s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[105434.289760s] DEBUG(runtime::session): [0, 0, 0, 3, 79, 0, 0, 0, 8, 115, 0, 0, 0, 6, 99, 111, 117, 110, 116, 115, 102, 64, 42, 248, 212, 253, 243, 182, 70, 107, 0, 0, 0, 9, 98, 114, 111, 97, 100, 99, 97, 115, 116, 98, 64, 0, 0, 0, 0, 1, 110]

[105434.290667s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 4, tag: [79, 115, 97, 105, 107, 98, 107, 98, 107, 98, 58, 110], data: 0x4fffd5f8 }
[105434.290996s] DEBUG(runtime::session): comm->host RpcRequest { async: false }
[105434.291159s] DEBUG(proto::rpc_proto): send<4>(Object, String, Array(Int32), Keyword(Bool), Keyword(Bool), Keyword(Bool))->None
[105436.486324s] DEBUG(smoltcp::socket::tcp): #1:132.163.182.107:1381:132.163.182.162:56400: segment not in receive window (1254199323..1254199324 not intersecting 1254199324..1254264859), will send challenge ACK
[105436.486892s] DEBUG(smoltcp::socket::tcp): #1:132.163.182.107:1381:132.163.182.162:56400: timeout exceeded
[105436.487498s]  INFO(runtime::session): no connection, starting idle kernel
[105436.493298s]  INFO(runtime::session): no idle kernel found
[105463.020899s] DEBUG(smoltcp::iface::ethernet): non-unicast source address
[105463.021011s] DEBUG(smoltcp::iface::ethernet): cannot process ingress packet: malformed packet
[105463.021137s] DEBUG(smoltcp::iface::ethernet): packet dump follows:
EthernetII src=00-07-7d-c5-d1-ff dst=00-0a-35-03-18-d4 type=ARP
\ ARP type=Ethernet+IPv4 src=00-07-7d-c5-d1-ff/0.0.0.0 tgt=00-0a-35-03-18-d4/132.163.182.107 op=Request
[105463.021871s]  WARN(runtime): network error: malformed packet
[105463.210595s] DEBUG(smoltcp::iface::ethernet): non-unicast source address
[105463.210705s] DEBUG(smoltcp::iface::ethernet): cannot process ingress packet: malformed packet
[105463.210828s] DEBUG(smoltcp::iface::ethernet): packet dump follows:
EthernetII src=f0-7f-06-e4-0e-ff dst=00-0a-35-03-18-d4 type=ARP
\ ARP type=Ethernet+IPv4 src=f0-7f-06-e4-0e-ff/0.0.0.0 tgt=00-0a-35-03-18-d4/132.163.182.107 op=Request
[105463.211548s]  WARN(runtime): network error: malformed packet
[105463.937398s] DEBUG(smoltcp::iface::ethernet): non-unicast source address
[105463.937508s] DEBUG(smoltcp::iface::ethernet): cannot process ingress packet: malformed packet
[105463.937631s] DEBUG(smoltcp::iface::ethernet): packet dump follows:
EthernetII src=f0-7f-06-e4-0e-7f dst=00-0a-35-03-18-d4 type=ARP
\ ARP type=Ethernet+IPv4 src=f0-7f-06-e4-0e-7f/0.0.0.0 tgt=00-0a-35-03-18-d4/132.163.182.107 op=Request
[105463.938350s]  WARN(runtime): network error: malformed packet
[105478.488442s]  INFO(runtime::mgmt): new connection from 132.163.182.162:56411
whitequark commented 6 years ago

This looks like the culprit:

[105436.486324s] DEBUG(smoltcp::socket::tcp): #1:132.163.182.107:1381:132.163.182.162:56400: segment not in receive window (1254199323..1254199324 not intersecting 1254199324..1254264859), will send challenge ACK
[105436.486892s] DEBUG(smoltcp::socket::tcp): #1:132.163.182.107:1381:132.163.182.162:56400: timeout exceeded

There have been some fixes upstream in smoltcp, I'll port ARTIQ to a newer version of it.

whitequark commented 6 years ago

I've updated smoltcp in master, please retest.

sbourdeauducq commented 6 years ago

@dhslichter Note that the potential fix isn't in release-3 yet.

dhslichter commented 6 years ago

@sbourdeauducq @whitequark sorry, too many things on the plate here. I will try to get this tested this week, would be nice to have the fix.

sbourdeauducq commented 6 years ago

@dhslichter Is the problem still present? It should be fixed in release-3 as well now.

sbourdeauducq commented 6 years ago

@dhslichter It takes us a lot of effort to fix those things; can you please test?

dhslichter commented 6 years ago

@sbourdeauducq sorry, Jonas Keller has tested it here and we didn't see any more of the previous communications issues. Let's close.

sbourdeauducq commented 6 years ago

Good, thanks!

dhslichter commented 6 years ago

Will update/reopen if this rears its head again, but we will call it good for now.