microsoft / iqsharp

Microsoft's IQ# Server.
https://docs.microsoft.com/quantum
MIT License
127 stars 58 forks source link

IQ# incompatible with pyzmq>=20.0.0 #376

Closed rmshaffer closed 2 years ago

rmshaffer commented 3 years ago

Describe the bug When running the IQ# kernel in an environment with pyzmq>=20.0.0, we see frequent intermittent hangs where the kernel seems to stop responding.

It seems there is an unhandled exception being thrown from NetMQ code, which is consumed by Microsoft.Jupyter.Core. For example, in one instance I saw the following unhandled exception stack trace:

Unhandled exception. System.Net.Sockets.SocketException (10035): A non-blocking socket operation could not be completed immediately.
   at NetMQ.Core.Mailbox.TryRecv(Int32 timeout, Command& command)
   at NetMQ.Core.SocketBase.ProcessCommands(Int32 timeout, Boolean throttle)
   at NetMQ.Core.SocketBase.TryRecv(Msg& msg, TimeSpan timeout)
   at NetMQ.NetMQSocket.TryReceive(Msg& msg, TimeSpan timeout)
   at NetMQ.ReceivingSocketExtensions.Receive(IReceivingSocket socket, Msg& msg)
   at NetMQ.ReceivingSocketExtensions.ReceiveMultipartBytes(IReceivingSocket socket, List`1& frames, Int32 expectedFrameCount)
   at Microsoft.Jupyter.Core.Extensions.ReceiveMessage(NetMQSocket socket, KernelContext context, Encoding encoding)
   at Microsoft.Jupyter.Core.ShellServer.EventLoop(NetMQSocket socket)
   at Microsoft.Jupyter.Core.ShellServer.<Start>b__12_1()
   at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)

To Reproduce Steps to reproduce the behavior:

  1. Install IQ# in an environment with pyzmq>=20.0.0.
  2. Run some complex Q# notebook or Python script with a large number of cell executions/calls to the IQ# kernel.
  3. Observe that sometimes the notebook/script execution will simply hang and never complete.

We especially observed this when our CI runs were installing pyzmq>=20.0.0.

Expected behavior No hangs. All commands should complete.

System information

Additional context This has been mitigated for now in #371 by adding pyzmq<20.0.0 to the requirements of the IQ# Python and Conda packages, as well as pinning our CI builds to pyzmq==19.0.2. But it would be nice to fix this underlying issue so that we can remove that restriction and consume newer versions of pyzmq as needed.

rmshaffer commented 3 years ago

One possibility to investigate: Upgrade Microsoft.Jupyter.Core to consume a recent version of NetMQ and take advantage of new thread-safe socket implementation from https://github.com/zeromq/netmq/pull/871.

cgranade commented 3 years ago

One possibility to investigate: Upgrade Microsoft.Jupyter.Core to consume a recent version of NetMQ and take advantage of new thread-safe socket implementation from zeromq/netmq#871.

I wonder if this may also be the root cause of #510?

rmshaffer commented 3 years ago

One possibility to investigate: Upgrade Microsoft.Jupyter.Core to consume a recent version of NetMQ and take advantage of new thread-safe socket implementation from zeromq/netmq#871.

I wonder if this may also be the root cause of #510?

It does indeed sound very similar. I don't know that the proposed upgrade of NetMQ would actually resolve this, but it's my best guess as to what might be going wrong here since, at least from what I observed, the process appeared to be hung in that part of the NetMQ stack.

anpaz commented 2 years ago

This should be fixed with #708, which is now released as part of the QDK 0.25.228311+