knipknap / exscript

A Python module making Telnet and SSH easy
MIT License
364 stars 136 forks source link

Fundamental incompatibility of Threading and Multiprocessing (Python) #230

Open martinpakosch opened 11 months ago

martinpakosch commented 11 months ago

Hi @knipknap, hi community,

the last days - after switching some old environments from Python 2.7 with Exscript 2.6.3/2.6.22 to Python 3.11 with Exscript 2.6.28 - we have encountered some very strange and unpredictable deadlocks while using the exscript script to run some Exscript templates. First we thought these issues are related to our new environment, then maybe to some bugs with Locks, handled by Exscript, but finally we found out, that this is a fundamental issue in Python! More about this later...

Exscript is designed internally to rely on threading for some of his mayor functionalities, like the MainLoop in the WorkQueue, Jobs and some pipeline related stuff. See https://github.com/search?q=repo%3Aknipknap%2Fexscript%20threading.Thread&type=code. Additionally, around 12 years ago, there was support for multiprocessing introduced (e.g., e43f6b7e4fc979eca69245befb8218bbb7f240fb, 0baf858270eba2fbc2517d4d6b368567d53b0789). From that point, Exscript by default behaved differently for the two major use-cases:

  1. When using the exscript script to just execute Exscript templates, Exscript was started in multiprocessing mode. This means essentially: a. one main process b. many internal threads c. one multiprocessing.Process for each job (=host)
  2. When using the Exscript API in Python for more complex scenarios, everything was pinned to threading mode by default, which was good, because then we get essentially: a. one main process b. many internal threads c. one threading.Thread for each job (=host)

So, now you might wonder why all this is a big deal. ;-) Well, the problem is, that according to latest changes in Python 3.12, we found out, that the combination of multiprocessing and threading is not safe/stable for any POSIX system and also not considered stable by the CPython implementation! To not replicate all the stuff I found out about this fundamental issue in Python, I would like to share these articles/posts, that I found - they should explain the problem:

  1. https://pythonspeed.com/articles/faster-multiprocessing-pickle/
  2. https://pythonspeed.com/articles/python-multiprocessing/
  3. https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
  4. https://docs.python.org/3/library/os.html#os.fork
  5. https://discuss.python.org/t/concerns-regarding-deprecation-of-fork-with-alive-threads/33555/2
  6. https://github.com/python/cpython/issues/84559

Quick solution: Now, coming back to our issue, we were able to eliminate the deadlocks by changing the default in the exscript script to threading. See: https://github.com/knipknap/exscript/blob/9d5b035f3de4237dc6ecb7437b3ebd0c162bb6ec/scripts/exscript#L182 If most of you have not encountered such deadlocks until now in the use-case 1, then you maybe have been lucky as we for the past 10 years running our older environments! But it does not change anything about the fundamental issue here.

Long-term solution: The change above affects only the exscript script and should be good enough as a quick fix. On the long term, @knipknap @bigmars86, you should think about a fundamental fix - if that is possible at all - or about dropping multiprocessing support for Exscript. What I found out so far regarding possible solutions:

  1. The default method for new processes in Python on Linux is fork, relying on os.fork(), which creates a copy of the parent thread with almost(!) all states/data - see the articles above explaining everything. This will be dropped as a default in Python 3.14, making spawn the new default. But spawn and forkserver create clean child processes and fully rely on e.g. pickling to move data down to these childs - in case Exscript fundamentally relies on full process copies.
  2. I tried to do this in the exscript script with the current code base according to https://docs.python.org/3/library/multiprocessing.html#multiprocessing.set_start_method with both alternate methods and failed. Exscript is then unable to pickle some local scope objects - among possible other issue as I stopped digging here... Example:

    Traceback (most recent call last):
    File "/home/username/tmp/deployed/Exscript/workqueue/job.py", line 52, in run
      self.child.start(to_self)
    File "/home/username/tmp/deployed/Exscript/workqueue/job.py", line 94, in start
      base.start(self)
    File "/usr/lib64/python3.11/multiprocessing/process.py", line 121, in start
      self._popen = self._Popen(self)
                    ^^^^^^^^^^^^^^^^^
    File "/usr/lib64/python3.11/multiprocessing/context.py", line 224, in _Popen
      return _default_context.get_context().Process._Popen(process_obj)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/lib64/python3.11/multiprocessing/context.py", line 300, in _Popen
      return Popen(process_obj)
             ^^^^^^^^^^^^^^^^^^
    File "/usr/lib64/python3.11/multiprocessing/popen_forkserver.py", line 35, in __init__
      super().__init__(process_obj)
    File "/usr/lib64/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
      self._launch(process_obj)
    File "/usr/lib64/python3.11/multiprocessing/popen_forkserver.py", line 47, in _launch
      reduction.dump(process_obj, buf)
    File "/usr/lib64/python3.11/multiprocessing/reduction.py", line 60, in dump
      ForkingPickler(file, protocol).dump(obj)
    AttributeError: Can't pickle local object '_make_process_class.<locals>.process_cls'
    
    # After "fixing" this by creating static public custom classes for Process and Thread in Exscript/workqueue/job.py,
    # I encountered the next one... stopped digging here. ;-)
    
    Traceback (most recent call last):
    File "/home/username/tmp/deployed/Exscript/workqueue/job.py", line 52, in run
      self.child.start(to_self)
    File "/home/username/tmp/deployed/Exscript/workqueue/job.py", line 152, in start
      super(Process, self).start()
    File "/usr/lib64/python3.11/multiprocessing/process.py", line 121, in start
      self._popen = self._Popen(self)
                    ^^^^^^^^^^^^^^^^^
    File "/usr/lib64/python3.11/multiprocessing/context.py", line 224, in _Popen
      return _default_context.get_context().Process._Popen(process_obj)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/lib64/python3.11/multiprocessing/context.py", line 300, in _Popen
      return Popen(process_obj)
             ^^^^^^^^^^^^^^^^^^
    File "/usr/lib64/python3.11/multiprocessing/popen_forkserver.py", line 35, in __init__
      super().__init__(process_obj)
    File "/usr/lib64/python3.11/multiprocessing/popen_fork.py", line 19, in __init__
      self._launch(process_obj)
    File "/usr/lib64/python3.11/multiprocessing/popen_forkserver.py", line 47, in _launch
      reduction.dump(process_obj, buf)
    File "/usr/lib64/python3.11/multiprocessing/reduction.py", line 60, in dump
      ForkingPickler(file, protocol).dump(obj)
    AttributeError: Can't pickle local object '_prepare_connection.<locals>._wrapped'
  3. For me this is beyond my skills - I am happy to understand this stuff until this point to be honest. :D So I assume to fix this - basically to make Exscript compatible with the spawn or forkserver method for multiprocessing - will cost some bigger efforts, unless you already know where the make your hands dirty.

That's from my side. Hope this helps in any way. Cheers, Martin

knipknap commented 11 months ago

Yeah, I doubt this can or should be solved without a major redesign of the whole queuing mechanism. By now, Python comes with good support for pooling, and the whole Queue class could be redesigned on top of that (or on another third party library for handling pools).

But I would guess it simply isn't worth it. Multiprocessing support was originally introduced to solve a performance bottleneck, which happened in two scenarios:

That's what I suspect, at least. Now if that turns out to be true in practice is to be tested. If not - it's redesign time.

martinpakosch commented 11 months ago

Hi Samuel,

thanks for your explanation.

  1. Currently, the Exscript CLI help suggests to only use 1-20 concurrent connections. https://github.com/knipknap/exscript/blob/9d5b035f3de4237dc6ecb7437b3ebd0c162bb6ec/scripts/exscript#L350 But you are right, there seems surprisingly to be no hard limit in the internals. :D
  2. You are right, using a 1:1 mapping of Accounts to external Workers does not scale very good, but at least it seems to work. a. Regarding an external sync for TACACS accounts: sounds like a queue encapsulating a Exscript.Queue. 🤔🙈 b. Regarding RabittMQ: That's a topic for @bigmars86. :D

Well, I can tell, that we once had a config-rollout across 600 routers, done in 5 Min. with 20 concurrent connections and an account-pool of 20 accounts - and yes, that was on our old environment working fine with "multiprocessing" for the CLI tool.

On the other hand, I would agree, that nowadays threading shouldn't be that much of a bottle-neck in most I/O bound scenarios when automating tasks on network-elements - maybe other users can comment on this in their setups? I would expect, that nowadays high-cost pre- or post-processing of data would be out-sourced to dedicated multiprocessing tools/scripts instead of trying to solve everything with the "old" (but still fine ;-)) Exscript Queue.

So yes, I also doubt going for a redesign would make much sense nowadays - at least for our use-cases - and I would also reject any quick hacks on this just to save efforts. So, the consequence would be to drop official multiprocessing support in Exscript for now (making just the "multiprocessing" mode invalid) - maybe until there is someone/some time/passion for such a redesign... ;-)

What do you think?