jupyter-xeus / xeus-python

Jupyter kernel for the Python programming language
BSD 3-Clause "New" or "Revised" License
428 stars 73 forks source link

Hang when starting debugging #275

Open lassoan opened 4 years ago

lassoan commented 4 years ago

When using 3D Slicer's xeus-python kernel and enabling debugging in Jupyterlab, the kernel hangs in xdebugger::process_request_impl after it sends a header and waits for acknowledgment:

https://github.com/jupyter-xeus/xeus-python/blob/5d95533d51f28e4f156c3fda6a349090749078d3/src/xdebugger.cpp#L88

Debugger works when using the official xpython kernel, so maybe it has something to do how xeus-python is used in Slicer, i.e., we process zmq messages by polling with a timer:

https://github.com/Slicer/SlicerJupyter/blob/cb5505785f604d7d3af665dee3fea53bf052e404/JupyterKernel/xSlicerServer.cxx#L31

Have you experienced a problem like this? Do you have any suggestions how to address this?

SylvainCorlay commented 4 years ago

I would need to look into this, but it is possible that the (current) concurrency model of the Slicer kernel is the reason for this issue.

Long story short, in order to be able to process debug messages while Python code is running, we run the control channel and the shell channel in two different native threads.

Johan put together two xserver implementations:

I presume, that theoretically, the Slicer Jupyter could process control messages on the Qt event loop. The only limitation with that is that for example, the message for adding a breakpoint is only processed after the current execution request has finished running, so that adding a breakpoint in the middle of a long running loop will not be possible.

So without having looked into the details, we may need to check

lassoan commented 4 years ago

Thanks for the information, it was very useful. I did some more debugging.

So without having looked into the details, we may need to check that the Slicer kernel processes control messages

Yes, it does. It happens on the main thread.

if yes, if we don't endup in a deadlock situation for some reason

I think it is a deadlock. Probably ptvsd wants to do something on the main thread but since the main thread is blocked (it is awaiting for the response from ptvsd), that does not get executed.

The deadlock could be resolved by not blocking the execution while we are waiting for an answer, let the execution continue and keep checking if ptvsd responds in the regularly called poll() method.

then if the debugging experience is not perfect, to check if the control channel cannot be split in a different thread like in xpython.

While we are running the code, we could run poll() every now and then, which would allow getting and processing control messages.

JohanMabille commented 4 years ago

When xeus-python starts the debugger, two things happen: the debugger adapter process (ptvsd) is spawn, and the debugger server is started within the same process (on the main thread). Then the debug adapter conects to the server. When it's done, it is available to reply to request sent by the client.

If you block the main thread before the debug server has started and notified the debug adapter, I guess this latter cannot process requests from the client, thus the "deadlock".

lassoan commented 4 years ago

The application does not block the main thread, it is blocked by the m_ptvsd_header.recv(raw_header); call in xeus-python/src/xdebugger.cpp.

Could you update xdebugger so that it does not block the thread waiting for this response? It could work the same way as other command processing done on the main thread: check with poll if a message has arrived, and if yes then read it and process it.

Thank you very much for all your excellent work. Having this interactive debugging working would be just the cherry on top.

JohanMabille commented 4 years ago

So at this point the debugger server should have started (otherwise we would have the same issue in xpython itself). Also notice that the main thread does not talk directly to ptvsd, but to an intermediate thread instantiated here.

This call is supposed to be blocking, the debugger needs to set the header before sending the core of the debugging request (this allows to keep track of the message chain in the Jupyter protocol). Until it receives an answer from ptvsd (or strictly speaking, from the thread talking to ptvsd), it should not process any other message or perform any other task. I would prefer to understand what is going on before complicating the design here, because there might be another, deeper, reason for having this lock. Replacing this call with a non-blocking one might not solve the issue eventually.

Can you check if the thread running ptvsd_client never blocks? I remember we had some random dead locks on the CI , but I could never reproduce locally and I was not sure whether it was due to the kernel or the rough implementation of the client in the tests.

lassoan commented 4 years ago

OK, thanks, I'll check this and get back to you.