Python 3.5.2 on Windows 10 select.select bogs down other threads

vessellaj commented 7 years ago

Using Connection.serve_all() in its own thread on Windows 10 results in calls to select.select choking out the execution of other threads when select.select is given a timeout.

In order to work around this issue, I've implemented my own serve_all which calls Connection.serve(0) and uses time.sleep(0.1) to limit CPU spinning. This approach does not have the issue.

I've tested the issue with Python 3.4.3 on Fedora 23, and threading Connection.serve_all does not cause this issue there. I don't know if this is the select.select function causing this in general, or if it's specifically the Windows system call since the Linux polling method does not use select.select.

coldfix commented 7 years ago

Using Connection.serve_all() in its own thread on Windows 10 results in calls to select.select choking out the execution of other threads when select.select is given a timeout.

What do you mean "choking out"? According to the python source code, the GIL is released in select.select, so you mean it cycles at 100% CPU?

Judging from the code in rpyc.core.stream.Win32PipeStream.poll someone had the same idea as you (using time.sleep(timeout))

IMO, I think time.sleep(0.1) is not viable, so I'd prefer to have something better. However, I haven't looked into windows things so far..

What python version are you running on windows, do you use any special server code?

coldfix commented 7 years ago

Can you provide an example server+client so I can check if I can reproduce the issue?

Best, Thomas

vessellaj commented 7 years ago

Using Python 3.5.2 on Windows 10.

By choking out the main thread, what's happening is that the calls to select.select are somehow blocking the main thread from executing at a reasonable pace. CPU usage seems to remain at or near 0% the whole time, yet the main thread's processing massively slows down. A task that normally takes about 5-7 seconds to run without the Connection.serve_all thread running (or with my workaround) will take a minute or more to run when I have Connection.serve_all in its own thread.

vessellaj commented 7 years ago

Well this is interesting...

I'm not able to reproduce the issue on a small example at all. I suppose there's some strange interaction happening in the main project that doesn't behave well with the default Connection.serve_all.

Unfortunately, the main project is currently internal and hasn't gone through the legal hoops for releasing the source yet, so I can't even give you that to try.

The server code I'm using is a placeholder for now, and its code is the following:

import threading

import rpyc
from rpyc.utils import server

class ServerService(rpyc.Service):
    @staticmethod
    def exposed_push_feedback(status_dict, exception):
        print(status_dict)
        print(exception)

    @staticmethod
    def exposed_register(name, operating_system):
        print('New connection! %s running on %s.' % (name, operating_system))

srv = server.ThreadPoolServer(ServerService, port=18812)
t = threading.Thread(target=srv.start)
t.daemon = True
t.start()

I start the server with python3 -i server.py I'm just interacting with the client by calling srv.fd_to_conn[<fd>].root.new_task(<args>), and of course the client has a service providing the new_task method.

For now I can get by using my workaround, and if it does end up released I could potentially reopen this issue later.

coldfix commented 7 years ago

Quick question: are the other threads IO bound or CPU? I could imagine multiple select in parallel not working because of some weird issue. And how many file descriptors are you waiting on? I think there can be performance issues if you have many. It's hard to say anything definite without reproducible example.

Otherwise, feel free to keep it open or closed as you like.

vessellaj commented 7 years ago

I would say it's mostly I/O bound, as it's most often waiting on external programs. We're using the Windows COM system to control MS Office programs - in particular, Outlook. We're also using Selenium to control Firefox, which I think uses sockets.

I did notice that the Outlook and Firefox tasks are where this problem is readily apparent, but I was noticing the slowdown in parts of those tasks where there should be no I/O with those external programs - there's a method in each that is just validating the configuration input, and I was seeing that each line executed in that method was spaced apart by several seconds each after I had started the task through RPC. This is the client program.

None of that happens when I just load a configuration file from the disk, or use our (outdated) Boost IPC method, and of course it stops happening when I directly call the Connection.serve(0) method in a loop with time.sleep(0.1) instead of Connection.serve_all.

At the time I discovered this, I was only waiting on one file descriptor in both client and server.

comrumino commented 5 years ago

So I checked out at 2689759. Was able to reproduce the issue

Confirmed the minor improvement using suggestions from this thread.

Checked out current master and tested again

This has been resolved already---most likely duplicate of #306 (where I found the test case).

tomerfiliba-org / rpyc

Python 3.5.2 on Windows 10 select.select bogs down other threads #226