m-labs / artiq

A leading-edge control system for quantum information experiments
https://m-labs.hk/artiq
GNU Lesser General Public License v3.0
434 stars 201 forks source link

RPC issue causing kernel to stuck in running state under certain conditions #1799

Closed stevefan1999 closed 2 years ago

stevefan1999 commented 2 years ago

Bug Report

Issue Details

This bug is seen plaguing the build #75623 and build #73252 previously, and it is seemingly repeatable under specific conditions. The bug lead us to here, which is where the unit test stopped in both Hydra executions: https://github.com/m-labs/artiq/blob/4f723e19a6bc9e84c0eb627461fe625007307504/artiq/test/coredevice/test_embedding.py#L90

Indeed, after commenting out that line of code, the test pipeline continues smoothly. But after some close inspections, it seems like this probably isn't the entire story. I have written a Minimal Reproducible Code that replicated this issue pretty consistently.

Minimal reproducible code: ```python3 import numpy from artiq.experiment import * class KernelRoundtripStuck(EnvExperiment): def build(self): self.setattr_device("core") @kernel def roundtrip(self, obj, fn): fn(obj) def run(self): lam = lambda _: None for i in range(1, 5): self.roundtrip(None, lam) self.roundtrip(numpy.array([1, 2, 3], dtype=numpy.int32), lam) self.roundtrip(numpy.array([1.0, 2.0, 3.0]), lam) self.roundtrip(numpy.array(["a", "b", "c"]), lam) self.roundtrip(numpy.array([[1, 2], [3, 4]], dtype=numpy.int32), lam) print("passed") ```

Expected Behavior

It should pass no matter what

Actual (undesired) Behavior

It stuck like this in the log below.

Click here to see the UART log ``` [ 36104.746180s] INFO(runtime::session): no connection, starting idle kernel [ 36104.751942s] INFO(runtime::session): no idle kernel found [ 36128.716513s] INFO(runtime::session): new connection from :47884 [ 36128.723274s] DEBUG(runtime::session): comm<-host SystemInfo [ 36128.727924s] DEBUG(runtime::session): comm->host SystemInfo { ident: "7.7901.4f723e19.beta;nist_clock", finished_cleanly: true } [ 36128.741635s] DEBUG(runtime::session): comm<-host LoadLibrary(...) [ 36128.770242s] DEBUG(runtime::session): comm->kern LoadRequest(...) [ 36128.775442s] DEBUG(runtime::session): comm<-kern LoadReply(Ok(())) [ 36128.781197s] DEBUG(runtime::session): comm->host LoadCompleted [ 36128.788080s] DEBUG(runtime::session): comm<-host RunKernel [ 36128.792818s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 2, tag: [102, 58, 110], data: 0x4ffffeb4 } [ 36128.803688s] DEBUG(runtime::session): comm->host RpcRequest { async: false } [ 36128.810740s] DEBUG(proto_artiq::rpc_proto): send<2>(Float64)->None [ 36129.818758s] DEBUG(runtime::session): comm<-host RpcReply { tag: [110] } [ 36129.824387s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4ffffec0) [ 36129.831094s] DEBUG(proto_artiq::rpc_proto): recv ...->None [ 36129.836569s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0)) [ 36129.843004s] DEBUG(runtime::session): comm<-kern (async RPC) [ 36129.848546s] DEBUG(runtime::session): comm->host RpcRequest { async: true } [ 36129.855515s] DEBUG(runtime::session): [1, 0, 0, 0, 110, 0, 1, 0, 0, 0, 110] [ 36129.862691s] DEBUG(runtime::session): comm<-kern RpcFlush [ 36129.868170s] DEBUG(runtime::session): comm<-kern RunFinished [ 36129.873476s] DEBUG(runtime::session): comm->host KernelFinished { async_errors: 0 } [ 36129.924269s] DEBUG(runtime::session): comm<-host LoadLibrary(...) [ 36129.952773s] DEBUG(runtime::session): comm->kern LoadRequest(...) [ 36129.957975s] DEBUG(runtime::session): comm<-kern LoadReply(Ok(())) [ 36129.963728s] DEBUG(runtime::session): comm->host LoadCompleted [ 36129.970724s] DEBUG(runtime::session): comm<-host RunKernel [ 36129.975345s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 2, tag: [97, 1, 105, 58, 110], data: 0x4ffffea4 } [ 36129.986826s] DEBUG(runtime::session): comm->host RpcRequest { async: false } [ 36129.993878s] DEBUG(proto_artiq::rpc_proto): send<2>(Array(Int32, 1))->None [ 36130.002046s] DEBUG(runtime::session): comm<-host RpcReply { tag: [110] } [ 36130.007677s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4ffffeb0) [ 36130.014383s] DEBUG(proto_artiq::rpc_proto): recv ...->None [ 36130.019859s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0)) [ 36130.026294s] DEBUG(runtime::session): comm<-kern (async RPC) [ 36130.031835s] DEBUG(runtime::session): comm->host RpcRequest { async: true } [ 36130.038805s] DEBUG(runtime::session): [1, 0, 0, 0, 110, 0, 1, 0, 0, 0, 110] [ 36130.045980s] DEBUG(runtime::session): comm<-kern RpcFlush [ 36130.051460s] DEBUG(runtime::session): comm<-kern RunFinished [ 36130.056765s] DEBUG(runtime::session): comm->host KernelFinished { async_errors: 0 } [ 36130.150699s] DEBUG(runtime::session): comm<-host LoadLibrary(...) [ 36130.179207s] DEBUG(runtime::session): comm->kern LoadRequest(...) [ 36130.184410s] DEBUG(runtime::session): comm<-kern LoadReply(Ok(())) [ 36130.190162s] DEBUG(runtime::session): comm->host LoadCompleted [ 36130.197159s] DEBUG(runtime::session): comm<-host RunKernel [ 36130.201782s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 2, tag: [97, 1, 102, 58, 110], data: 0x4ffffe94 } [ 36130.213260s] DEBUG(runtime::session): comm->host RpcRequest { async: false } [ 36130.220312s] DEBUG(proto_artiq::rpc_proto): send<2>(Array(Float64, 1))->None [ 36130.229038s] DEBUG(runtime::session): comm<-host RpcReply { tag: [110] } [ 36130.234665s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4ffffea0) [ 36130.241375s] DEBUG(proto_artiq::rpc_proto): recv ...->None [ 36130.246850s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0)) [ 36130.253285s] DEBUG(runtime::session): comm<-kern (async RPC) [ 36130.258826s] DEBUG(runtime::session): comm->host RpcRequest { async: true } [ 36130.265796s] DEBUG(runtime::session): [1, 0, 0, 0, 110, 0, 1, 0, 0, 0, 110] [ 36130.272971s] DEBUG(runtime::session): comm<-kern RpcFlush [ 36130.278295s] DEBUG(runtime::session): comm<-kern RunFinished [ 36130.283756s] DEBUG(runtime::session): comm->host KernelFinished { async_errors: 0 } [ 36130.375952s] DEBUG(runtime::session): comm<-host LoadLibrary(...) [ 36130.404463s] DEBUG(runtime::session): comm->kern LoadRequest(...) [ 36130.409666s] DEBUG(runtime::session): comm<-kern LoadReply(Ok(())) [ 36130.415418s] DEBUG(runtime::session): comm->host LoadCompleted [ 36130.422475s] DEBUG(runtime::session): comm<-host RunKernel [ 36130.427039s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 2, tag: [97, 1, 115, 58, 110], data: 0x4ffffe94 } [ 36130.438517s] DEBUG(runtime::session): comm->host RpcRequest { async: false } [ 36130.445569s] DEBUG(proto_artiq::rpc_proto): send<2>(Array(String, 1))->None [ 36130.454447s] DEBUG(runtime::session): comm<-host RpcReply { tag: [110] } [ 36130.460078s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4ffffea0) [ 36130.466784s] DEBUG(proto_artiq::rpc_proto): recv ...->None [ 36130.472260s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0)) [ 36130.478694s] DEBUG(runtime::session): comm<-kern (async RPC) [ 36130.484236s] DEBUG(runtime::session): comm->host RpcRequest { async: true } [ 36130.491206s] DEBUG(runtime::session): [1, 0, 0, 0, 110, 0, 1, 0, 0, 0, 110] [ 36130.498569s] DEBUG(runtime::session): comm<-kern RpcFlush [ 36130.503705s] DEBUG(runtime::session): comm<-kern RunFinished [ 36130.509166s] DEBUG(runtime::session): comm->host KernelFinished { async_errors: 0 } [ 36130.608351s] DEBUG(runtime::session): comm<-host LoadLibrary(...) [ 36130.636850s] DEBUG(runtime::session): comm->kern LoadRequest(...) [ 36130.642115s] DEBUG(runtime::session): comm<-kern LoadReply(Ok(())) [ 36130.647805s] DEBUG(runtime::session): comm->host LoadCompleted [ 36130.654682s] DEBUG(runtime::session): comm<-host RunKernel [ 36130.659426s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 2, tag: [97, 2, 105, 58, 110], data: 0x4ffffe74 } [ 36130.670904s] DEBUG(runtime::session): comm->host RpcRequest { async: false } [ 36130.677956s] DEBUG(proto_artiq::rpc_proto): send<2>(Array(Int32, 2))->None [ 36130.686695s] DEBUG(runtime::session): comm<-host RpcReply { tag: [110] } [ 36130.692325s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4ffffe80) [ 36130.699032s] DEBUG(proto_artiq::rpc_proto): recv ...->None [ 36130.704507s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0)) [ 36130.710942s] DEBUG(runtime::session): comm<-kern (async RPC) [ 36130.716484s] DEBUG(runtime::session): comm->host RpcRequest { async: true } [ 36130.723453s] DEBUG(runtime::session): [1, 0, 0, 0, 110, 0, 1, 0, 0, 0, 110] [ 36130.730628s] DEBUG(runtime::session): comm<-kern RpcFlush [ 36130.735952s] DEBUG(runtime::session): comm<-kern RunFinished [ 36130.741413s] DEBUG(runtime::session): comm->host KernelFinished { async_errors: 0 } [ 36130.825304s] DEBUG(runtime::session): comm<-host LoadLibrary(...) [ 36130.853818s] DEBUG(runtime::session): comm->kern LoadRequest(...) [ 36130.859019s] DEBUG(runtime::session): comm<-kern LoadReply(Ok(())) [ 36130.864774s] DEBUG(runtime::session): comm->host LoadCompleted [ 36130.871831s] DEBUG(runtime::session): comm<-host RunKernel [ 36130.876394s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 2, tag: [102, 58, 110], data: 0x4ffffeb4 } [ 36130.887264s] DEBUG(runtime::session): comm->host RpcRequest { async: false } [ 36130.894316s] DEBUG(proto_artiq::rpc_proto): send<2>(Float64)->None [ 36131.902933s] DEBUG(runtime::session): comm<-host RpcReply { tag: [110] } [ 36131.908560s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4ffffec0) [ 36131.915269s] DEBUG(proto_artiq::rpc_proto): recv ...->None [ 36131.920744s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0)) [ 36131.927179s] DEBUG(runtime::session): comm<-kern (async RPC) [ 36131.932722s] DEBUG(runtime::session): comm->host RpcRequest { async: true } [ 36131.939690s] DEBUG(runtime::session): [1, 0, 0, 0, 110, 0, 1, 0, 0, 0, 110] [ 36131.946866s] DEBUG(runtime::session): comm<-kern RpcFlush [ 36131.952346s] DEBUG(runtime::session): comm<-kern RunFinished [ 36131.957651s] DEBUG(runtime::session): comm->host KernelFinished { async_errors: 0 } [ 36132.016146s] DEBUG(runtime::session): comm<-host LoadLibrary(...) [ 36132.044650s] DEBUG(runtime::session): comm->kern LoadRequest(...) [ 36132.049855s] DEBUG(runtime::session): comm<-kern LoadReply(Ok(())) [ 36132.055607s] DEBUG(runtime::session): comm->host LoadCompleted [ 36132.062605s] DEBUG(runtime::session): comm<-host RunKernel [ 36132.067226s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 2, tag: [97, 1, 105, 58, 110], data: 0x4ffffea4 } [ 36132.078706s] DEBUG(runtime::session): comm->host RpcRequest { async: false } [ 36132.085758s] DEBUG(proto_artiq::rpc_proto): send<2>(Array(Int32, 1))->None [ 36132.094466s] DEBUG(runtime::session): comm<-host RpcReply { tag: [110] } [ 36132.100096s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4ffffeb0) [ 36132.106803s] DEBUG(proto_artiq::rpc_proto): recv ...->None [ 36132.112279s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0)) [ 36132.118714s] DEBUG(runtime::session): comm<-kern (async RPC) [ 36132.124255s] DEBUG(runtime::session): comm->host RpcRequest { async: true } [ 36132.131225s] DEBUG(runtime::session): [1, 0, 0, 0, 110, 0, 1, 0, 0, 0, 110] [ 36132.138400s] DEBUG(runtime::session): comm<-kern RpcFlush [ 36132.143880s] DEBUG(runtime::session): comm<-kern RunFinished [ 36132.149186s] DEBUG(runtime::session): comm->host KernelFinished { async_errors: 0 } [ 36132.239708s] DEBUG(runtime::session): comm<-host LoadLibrary(...) [ 36132.268219s] DEBUG(runtime::session): comm->kern LoadRequest(...) [ 36132.273422s] DEBUG(runtime::session): comm<-kern LoadReply(Ok(())) [ 36132.279174s] DEBUG(runtime::session): comm->host LoadCompleted [ 36132.286174s] DEBUG(runtime::session): comm<-host RunKernel [ 36132.290795s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 2, tag: [97, 1, 102, 58, 110], data: 0x4ffffe94 } [ 36132.302272s] DEBUG(runtime::session): comm->host RpcRequest { async: false } [ 36132.309325s] DEBUG(proto_artiq::rpc_proto): send<2>(Array(Float64, 1))->None [ 36132.318054s] DEBUG(runtime::session): comm<-host RpcReply { tag: [110] } [ 36132.323680s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4ffffea0) [ 36132.330390s] DEBUG(proto_artiq::rpc_proto): recv ...->None [ 36132.335865s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0)) [ 36132.342300s] DEBUG(runtime::session): comm<-kern (async RPC) [ 36132.347842s] DEBUG(runtime::session): comm->host RpcRequest { async: true } [ 36132.354812s] DEBUG(runtime::session): [1, 0, 0, 0, 110, 0, 1, 0, 0, 0, 110] [ 36132.361987s] DEBUG(runtime::session): comm<-kern RpcFlush [ 36132.367310s] DEBUG(runtime::session): comm<-kern RunFinished [ 36132.372771s] DEBUG(runtime::session): comm->host KernelFinished { async_errors: 0 } [ 36132.465532s] DEBUG(runtime::session): comm<-host LoadLibrary(...) [ 36132.494040s] DEBUG(runtime::session): comm->kern LoadRequest(...) [ 36132.499243s] DEBUG(runtime::session): comm<-kern LoadReply(Ok(())) [ 36132.504995s] DEBUG(runtime::session): comm->host LoadCompleted [ 36132.512051s] DEBUG(runtime::session): comm<-host RunKernel [ 36132.516612s] DEBUG(runtime::session): comm<-kern RpcSend { async: false, service: 2, tag: [97, 1, 115, 58, 110], data: 0x4ffffe94 } [ 36132.528093s] DEBUG(runtime::session): comm->host RpcRequest { async: false } [ 36132.535145s] DEBUG(proto_artiq::rpc_proto): send<2>(Array(String, 1))->None [ 36132.544214s] DEBUG(runtime::session): comm<-host RpcReply { tag: [110] } [ 36132.549843s] DEBUG(runtime::session): comm<-kern RpcRecvRequest(0x4ffffea0) [ 36132.556550s] DEBUG(proto_artiq::rpc_proto): recv ...->None [ 36132.562025s] DEBUG(runtime::session): comm->kern RpcRecvReply(Ok(0)) [ 36132.568460s] DEBUG(runtime::session): comm<-kern (async RPC) [ 36132.574002s] DEBUG(runtime::session): comm->host RpcRequest { async: true } [ 36132.580972s] DEBUG(runtime::session): [1, 0, 0, 0, 110, 0, 1, 0, 0, 0, 110] [ 36132.588147s] DEBUG(runtime::session): comm<-kern RpcFlush [ 36132.593470s] DEBUG(runtime::session): comm<-kern RunFinished [ 36132.598931s] DEBUG(runtime::session): comm->host KernelFinished { async_errors: 0 } [ 36132.695840s] DEBUG(runtime::session): comm<-host LoadLibrary(...) [ 36132.724339s] DEBUG(runtime::session): comm->kern LoadRequest(...) [ 36132.729603s] DEBUG(runtime::session): comm<-kern LoadReply(Ok(())) [ 36132.735294s] DEBUG(runtime::session): comm->host LoadCompleted [ 36132.742351s] DEBUG(runtime::session): comm<-host RunKernel ```

It seems like this issue is from numpy.

Your System (omit irrelevant parts)

It is also reported there is some hardware fluke on KC705. Maybe we should take this to other hardwares for sure.

Notice

~Commit up to df6aeb99f6da58dc76deb30f92a32a2bcf1d6589 also succumbed to this~ It seems fine after I rebuilt the gateware, the new problematic commit is 20e079a381b777550ce7872151b17fb4a11cd7ae

7209e6f2798c70333e115d86c631a34f19b20be5 is seemingly a good commit

sbourdeauducq commented 2 years ago

It is also reported there is some hardware fluke on KC705.

Other than the problems due to the IP settings?

stevefan1999 commented 2 years ago

Somehow this is related to #1798, and I can pass the MRC building with PR #1802, but I should try more rounds of testing first.

stevefan1999 commented 2 years ago

Case resolved with #1802