kquick / Thespian

Python Actor concurrency library
MIT License
189 stars 24 forks source link

Benchmarking application fails to run with different ActorSystem Implementations #91

Closed ProlucidDavid closed 1 year ago

ProlucidDavid commented 1 year ago

Intro

I'm benchmarking the Thespian Actor Framework for an application with the goal of understanding how performance varies based on:

The specific test can be here. The test defined at that repo highlights challenges with multiprocQueueBase and multiprocTCPBase ActorSystems which have been summarized below:

multiprocQueueBase

The following configuration (for main.py in the repo) has been observed to cause the application to freeze.

# Configure the test
NUM_ACTORS = 20
BENCHMARK_MESSAGE_SEND_PERIOD_S = 0.01
NUM_BENCHMARK_MESSAGES_PER_ACTOR = 50
COOLDOWN_PERIOD_S = 2
ACTOR_SYS = 'multiprocQueueBase'

Generally, this ActorSystem has been observed to freeze when there is a combination of: more parallel actors increases, higher frequency of messages sends or larger number of test messages being sent

Worth noting, in section 9.3.1 of the Thespian project docs, there is an issue with multiprocQueueBase described as "an unexplained, core-level concern about dropped messages/deadlocks for the queue messaging in overload conditions." It is possible this is the described behaviour.

multiprocTCPBase

This ActorSystem has not been observed to successfully run. The following error is observed each time it is executed:

2023-05-07 16:54:27,412 WARNING =>  Unable to get address info for address *************** (AddressFamily.AF_INET, SocketKind.SOCK_DGRAM, 17, 0): <class 'socket.gaierror'> [Errno 11001] getaddrinfo failed  [IPBase.py:16]
2023-05-07 16:54:27,415 WARNING =>  Unable to get address info for address *************** (AddressFamily.AF_INET, SocketKind.SOCK_DGRAM, 17, AddressInfo.AI_PASSIVE): <class 'socket.gaierror'> [Errno 11001] getaddrinfo failed  [IPBase.py:16]
WARNING:root:Unable to get address info for address *************** (AddressFamily.AF_INET, SocketKind.SOCK_DGRAM, 17, 0): <class 'socket.gaierror'> [Errno 11001] getaddrinfo failed
WARNING:root:Unable to get address info for address *************** (AddressFamily.AF_INET, SocketKind.SOCK_DGRAM, 17, AddressInfo.AI_PASSIVE): <class 'socket.gaierror'> [Errno 11001] getaddrinfo failed
WARNING:root:Unable to get address info for address *************** (AddressFamily.AF_INET, SocketKind.SOCK_DGRAM, 17, 0): <class 'socket.gaierror'> [Errno 11001] getaddrinfo failed
WARNING:root:Unable to get address info for address *************** (AddressFamily.AF_INET, SocketKind.SOCK_DGRAM, 17, AddressInfo.AI_PASSIVE): <class 'socket.gaierror'> [Errno 11001] getaddrinfo failed
Traceback (most recent call last):
  File "C:\Git\Scratch\ThespianBenchmarking\Main.py", line 23, in <module>
    actor_sys = ActorSystem(ACTOR_SYS)
  File "C:\Users\***************\.virtualenvs\ThespianBenchmarking-5ebENAd8\lib\site-packages\thespian\actors.py", line 637, in __init__
    systemBase = self._startupActorSys(
  File "C:\Users\***************\.virtualenvs\ThespianBenchmarking-5ebENAd8\lib\site-packages\thespian\actors.py", line 678, in _startupActorSys
    systemBase = sbc(self, logDefs=logDefs)
  File "C:\Users\***************\.virtualenvs\ThespianBenchmarking-5ebENAd8\lib\site-packages\thespian\system\multiprocTCPBase.py", line 28, in __init__
    super(ActorSystemBase, self).__init__(system, logDefs)
  File "C:\Users\***************\.virtualenvs\ThespianBenchmarking-5ebENAd8\lib\site-packages\thespian\system\multiprocCommon.py", line 83, in __init__
    super(multiprocessCommon, self).__init__(system, logDefs)
  File "C:\Users\***************\.virtualenvs\ThespianBenchmarking-5ebENAd8\lib\site-packages\thespian\system\systemBase.py", line 326, in __init__
    self._startAdmin(self.adminAddr,
  File "C:\Users\***************\.virtualenvs\ThespianBenchmarking-5ebENAd8\lib\site-packages\thespian\system\multiprocCommon.py", line 115, in _startAdmin
    raise InvalidActorAddress(adminAddr,
thespian.actors.InvalidActorAddress: ActorAddr-(T|:1900) is not a valid ActorSystem admin

Process finished with exit code 1

This behaviour is observed even when different ActorSystem capabilities are defined. The same results were achieved when the Windows firewall was disabled

Followup Questions

kquick commented 1 year ago

Thanks for the detailed information. The issue you are having on the multiprocTCPBase is discussed separately in issue 89, so it would probably be best to address them there.

In regards to the multiprocQueueBase, that is built upon the Python Queue library (https://docs.python.org/3/library/queue.html). The behavior you are seeing is similar to what I was seeing during development and which lead to that particular comment. I am particularly hesitant to lay the blame on Python's Queue library since it's probably had much more extensive development and usage than Thespian itself, but since Thespian relies on the scheduling from the Queue library, it is hard to see where the hangs could be caused by Thespian code. Due to the other deficiencies of the multiprocQueueBase (including: no convention support, no multi-system support) and because there were no user-reported concerns, I did not pursue debugging this in more detail. If you have a specific use case that would benefit from an effort to further research this issue then I would be open to further discussion on how to identify and resolve it (or work around it).

Although you didn't ask here, you do note in your referenced repository some quirkiness with the multiprocUDPBase. Those are fundamentally issues with the UDP protocol itself, and the multiprocUDPBase is a fairly thin layer that does not provide a lot of additional code on top of the UDP transport. While it would be possible to add additional code to overcome the various UDP limitiations (e.g. perform fragmentation/de-fragmentation, message delivery verification with retransmits, etc.), that would be a fair amount of work and the usual way to address these concerns would be to use the multiprocTCPBase.

In regards to performance in general, please see https://thespianpy.com/doc/using.html#hH-5e670e59-8334-477e-a9d2-0c2dc39d82dd for more information. Also see thespian/test/test_load.py for a simple comparative load-testing utility.

ProlucidDavid commented 1 year ago

Well thank you for the detailed (and quick) response!

I like the featureset in Thespian and am interested in using it as a framework to build applications around. My current application doesn't have a multi-system requirement, but we are concerned about being able to have actors run independently as separate processes without constraints on the message size. For that reason we're interested in getting either multiprocQueueBase or multiprocTCPBase running.

Just so we're on the same page, I'm going to try to summarize the state of the investigation

Thanks again! I really appreciate your help!

kquick commented 1 year ago

That assessment sounds correct to me. If you are comfortable with the status (pending work in issue 89), feel free to close this issue.