Closed dhslichter closed 6 years ago
Anything in the board log/UART?
Sounds like https://github.com/m-labs/artiq/issues/986 . doesn't look like it was cherry-picked for release-3
It was cherry-picked (9589decd54), but the fix also contained a LLVM update and caused https://github.com/m-labs/artiq/issues/1007. There has been no release since.
@sbourdeauducq haven't looked at the board log or UART, will do so when the event repeats. Does this need to wait for a release until #1007 is addressed or can it come out earlier?
Another traceback, for reference. This has not caused a total crash requiring an FPGA reboot, just killed the currently running kernel, exiting with the error message below:
ERROR:worker(18839,ion_monitor.py):root:Terminating with exception (ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host)
Traceback (most recent call last):
File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\master\worker_impl.py", line 279, in main
exp_inst.run()
File "c:\artiq_cryo1\repository\experiments\ion_monitor.py", line 61, in run
self.detloop()
File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\language\core.py", line 53, in run_on_core
return getattr(self, arg).run(run_on_core, ((self,) + k_args), k_kwargs)
File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\coredevice\core.py", line 133, in run
self.comm.serve(embedding_map, symbolizer, demangler)
File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 599, in serve
self._read_header()
File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 179, in _read_header
(sync_byte, ) = struct.unpack("B", self.read(1))
File "C:\Anaconda3\envs\artiq3.6\lib\site-packages\artiq\coredevice\comm_kernel.py", line 160, in read
rn = self.socket.recv(min(8192, length - len(r)))
ConnectionResetError: [WinError 10054] An existing connection was forcibly closed by the remote host
ERROR:master:artiq.master.scheduler:got worker exception in run stage, deleting RID 18839
Last few minutes of the corelog record downloaded here:
[ 23803.546057s] INFO(runtime::kern_hwreq): resetting RTIO
[ 23804.364343s] WARN(runtime): network error: malformed packet
[ 23804.574733s] WARN(runtime): network error: malformed packet
[ 23804.606177s] WARN(runtime): network error: malformed packet
[ 23810.118792s] INFO(runtime::session): no connection, starting idle kernel
[ 23810.124838s] INFO(runtime::session): no idle kernel found
[ 23811.912328s] INFO(runtime::session): new connection from 132.163.182.162:51464
[ 23811.963641s] INFO(runtime::kern_hwreq): resetting RTIO
[ 23833.387247s] WARN(runtime): network error: malformed packet
[ 23833.498708s] WARN(runtime): network error: malformed packet
[ 23834.569358s] WARN(runtime): network error: malformed packet
[ 23856.281737s] INFO(runtime::session): no connection, starting idle kernel
[ 23856.288163s] INFO(runtime::session): no idle kernel found
[ 23859.489129s] INFO(runtime::session): new connection from 132.163.182.162:51465
[ 23859.538776s] INFO(runtime::kern_hwreq): resetting RTIO
[ 23862.665890s] WARN(runtime): network error: malformed packet
[ 23863.238021s] WARN(runtime): network error: malformed packet
[ 23864.435143s] WARN(runtime): network error: malformed packet
[ 23873.966295s] INFO(runtime::session): no connection, starting idle kernel
[ 23873.972099s] INFO(runtime::session): no idle kernel found
[ 23900.164897s] WARN(runtime): network error: malformed packet
[ 23901.018328s] WARN(runtime): network error: malformed packet
[ 23901.445735s] WARN(runtime): network error: malformed packet
[ 23929.535799s] WARN(runtime): network error: malformed packet
[ 23930.094184s] WARN(runtime): network error: malformed packet
[ 23930.730374s] WARN(runtime): network error: malformed packet
[ 23958.804001s] WARN(runtime): network error: malformed packet
[ 23958.842125s] WARN(runtime): network error: malformed packet
[ 23960.656888s] WARN(runtime): network error: malformed packet
[ 24320.891686s] INFO(runtime::mgmt): new connection from 132.163.182.162:51509
[ 24351.798967s] INFO(runtime::mgmt): new connection from 132.163.182.162:51510
This "malformed packet" deal tends to show up often...
Does this need to wait for a release until #1007 is addressed or can it come out earlier?
You can try a release-3 dev package.
Another corelog dump, in case it is useful @whitequark. I started it from a bit before when the error occurred.
[105431.845797s] DEBUG(runtime::session): comm->host RpcRequest { async: false }
[105431.845961s] DEBUG(proto::rpc_proto): send<2>(Object)->Bool
[105431.849261s] DEBUG(runtime::session): comm<-host RpcReply { tag: [98] }
[105431.849961s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4fffd654)
[105431.850109s] DEBUG(proto::rpc_proto): recv ...->Bool
[105431.850234s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0))
[105432.652003s] DEBUG(runtime::session): comm<-kern (async RPC)
[105432.652111s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[105432.652269s] DEBUG(runtime::session): [0, 0, 0, 3, 79, 0, 0, 0, 8, 115, 0, 0, 0, 6, 99, 111, 117, 110, 116, 115, 102, 64, 44, 14, 86, 4, 24, 147, 117, 107, 0, 0, 0, 9, 98, 114, 111, 97, 100, 99, 97, 115, 116, 98, 64, 0, 0, 0, 0, 1, 110]
[105432.653328s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 4, tag: [79, 115, 97, 105, 107, 98, 107, 98, 107, 98, 58, 110], data: 0x4fffd618 }
[105432.653658s] DEBUG(runtime::session): comm->host RpcRequest { async: false }
[105432.653821s] DEBUG(proto::rpc_proto): send<4>(Object, String, Array(Int32), Keyword(Bool), Keyword(Bool), Keyword(Bool))->None
[105432.662564s] DEBUG(runtime::session): comm<-host RpcReply { tag: [110] }
[105432.663265s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4fffd638)
[105432.663416s] DEBUG(proto::rpc_proto): recv ...->None
[105432.663525s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0))
[105432.664370s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 2, tag: [79, 58, 98], data: 0x4fffd644 }
[105432.664622s] DEBUG(runtime::session): comm->host RpcRequest { async: false }
[105432.664785s] DEBUG(proto::rpc_proto): send<2>(Object)->Bool
[105432.668086s] DEBUG(runtime::session): comm<-host RpcReply { tag: [98] }
[105432.668785s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4fffd644)
[105432.668934s] DEBUG(proto::rpc_proto): recv ...->Bool
[105432.669057s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0))
[105433.470723s] DEBUG(runtime::session): comm<-kern (async RPC)
[105433.470832s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[105433.470989s] DEBUG(runtime::session): [0, 0, 0, 3, 79, 0, 0, 0, 8, 115, 0, 0, 0, 6, 99, 111, 117, 110, 116, 115, 102, 64, 43, 57, 88, 16, 98, 77, 211, 107, 0, 0, 0, 9, 98, 114, 111, 97, 100, 99, 97, 115, 116, 98, 64, 0, 0, 0, 0, 1, 110]
[105433.472048s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 4, tag: [79, 115, 97, 105, 107, 98, 107, 98, 107, 98, 58, 110], data: 0x4fffd608 }
[105433.472378s] DEBUG(runtime::session): comm->host RpcRequest { async: false }
[105433.472542s] DEBUG(proto::rpc_proto): send<4>(Object, String, Array(Int32), Keyword(Bool), Keyword(Bool), Keyword(Bool))->None
[105433.481280s] DEBUG(runtime::session): comm<-host RpcReply { tag: [110] }
[105433.481982s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4fffd628)
[105433.482133s] DEBUG(proto::rpc_proto): recv ...->None
[105433.482243s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0))
[105433.482978s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 2, tag: [79, 58, 98], data: 0x4fffd634 }
[105433.483228s] DEBUG(runtime::session): comm->host RpcRequest { async: false }
[105433.483392s] DEBUG(proto::rpc_proto): send<2>(Object)->Bool
[105433.486692s] DEBUG(runtime::session): comm<-host RpcReply { tag: [98] }
[105433.487393s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4fffd634)
[105433.487541s] DEBUG(proto::rpc_proto): recv ...->Bool
[105433.487665s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0))
[105433.987165s] DEBUG(smoltcp::iface::ethernet): address 132.163.182.162 not in neighbor cache, sending ARP request
[105434.289494s] DEBUG(runtime::session): comm<-kern (async RPC)
[105434.289602s] DEBUG(runtime::session): comm->host RpcRequest { async: true }
[105434.289760s] DEBUG(runtime::session): [0, 0, 0, 3, 79, 0, 0, 0, 8, 115, 0, 0, 0, 6, 99, 111, 117, 110, 116, 115, 102, 64, 42, 248, 212, 253, 243, 182, 70, 107, 0, 0, 0, 9, 98, 114, 111, 97, 100, 99, 97, 115, 116, 98, 64, 0, 0, 0, 0, 1, 110]
[105434.290667s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 4, tag: [79, 115, 97, 105, 107, 98, 107, 98, 107, 98, 58, 110], data: 0x4fffd5f8 }
[105434.290996s] DEBUG(runtime::session): comm->host RpcRequest { async: false }
[105434.291159s] DEBUG(proto::rpc_proto): send<4>(Object, String, Array(Int32), Keyword(Bool), Keyword(Bool), Keyword(Bool))->None
[105436.486324s] DEBUG(smoltcp::socket::tcp): #1:132.163.182.107:1381:132.163.182.162:56400: segment not in receive window (1254199323..1254199324 not intersecting 1254199324..1254264859), will send challenge ACK
[105436.486892s] DEBUG(smoltcp::socket::tcp): #1:132.163.182.107:1381:132.163.182.162:56400: timeout exceeded
[105436.487498s] INFO(runtime::session): no connection, starting idle kernel
[105436.493298s] INFO(runtime::session): no idle kernel found
[105463.020899s] DEBUG(smoltcp::iface::ethernet): non-unicast source address
[105463.021011s] DEBUG(smoltcp::iface::ethernet): cannot process ingress packet: malformed packet
[105463.021137s] DEBUG(smoltcp::iface::ethernet): packet dump follows:
EthernetII src=00-07-7d-c5-d1-ff dst=00-0a-35-03-18-d4 type=ARP
\ ARP type=Ethernet+IPv4 src=00-07-7d-c5-d1-ff/0.0.0.0 tgt=00-0a-35-03-18-d4/132.163.182.107 op=Request
[105463.021871s] WARN(runtime): network error: malformed packet
[105463.210595s] DEBUG(smoltcp::iface::ethernet): non-unicast source address
[105463.210705s] DEBUG(smoltcp::iface::ethernet): cannot process ingress packet: malformed packet
[105463.210828s] DEBUG(smoltcp::iface::ethernet): packet dump follows:
EthernetII src=f0-7f-06-e4-0e-ff dst=00-0a-35-03-18-d4 type=ARP
\ ARP type=Ethernet+IPv4 src=f0-7f-06-e4-0e-ff/0.0.0.0 tgt=00-0a-35-03-18-d4/132.163.182.107 op=Request
[105463.211548s] WARN(runtime): network error: malformed packet
[105463.937398s] DEBUG(smoltcp::iface::ethernet): non-unicast source address
[105463.937508s] DEBUG(smoltcp::iface::ethernet): cannot process ingress packet: malformed packet
[105463.937631s] DEBUG(smoltcp::iface::ethernet): packet dump follows:
EthernetII src=f0-7f-06-e4-0e-7f dst=00-0a-35-03-18-d4 type=ARP
\ ARP type=Ethernet+IPv4 src=f0-7f-06-e4-0e-7f/0.0.0.0 tgt=00-0a-35-03-18-d4/132.163.182.107 op=Request
[105463.938350s] WARN(runtime): network error: malformed packet
[105478.488442s] INFO(runtime::mgmt): new connection from 132.163.182.162:56411
This looks like the culprit:
[105436.486324s] DEBUG(smoltcp::socket::tcp): #1:132.163.182.107:1381:132.163.182.162:56400: segment not in receive window (1254199323..1254199324 not intersecting 1254199324..1254264859), will send challenge ACK
[105436.486892s] DEBUG(smoltcp::socket::tcp): #1:132.163.182.107:1381:132.163.182.162:56400: timeout exceeded
There have been some fixes upstream in smoltcp, I'll port ARTIQ to a newer version of it.
I've updated smoltcp in master, please retest.
@dhslichter Note that the potential fix isn't in release-3 yet.
@sbourdeauducq @whitequark sorry, too many things on the plate here. I will try to get this tested this week, would be nice to have the fix.
@dhslichter Is the problem still present? It should be fixed in release-3 as well now.
@dhslichter It takes us a lot of effort to fix those things; can you please test?
@sbourdeauducq sorry, Jonas Keller has tested it here and we didn't see any more of the previous communications issues. Let's close.
Good, thanks!
Will update/reopen if this rears its head again, but we will call it good for now.
Running artiq 3.6 with a KC705 at NIST, we get errors every ~10-20 minutes where communication between core device and PC fails and the current kernel is ejected. Usually the next experiment submitted is OK. Today it's gone farther and will not receive new kernels. Rebooting the FPGA fixes the issue. Traceback below: