Different questions about Thespian and its functionalities

htarnacki commented 3 years ago

hi, i hope you do not mind this type of task, collecting all my questions in one thread

What is the preferable way of enforcing actor system that some named actor always exist? so:
- we create an actor system
- then we create first main named actor (A) that we want it to exist forever
- even in case of an "A" actor crash we need some mechanism to bring it back to alive

Looking into documentation i see that named actors do not have parents and therefore there is no any actor that could receive "ChildActorExited" message. But i guess that an actor system itself is receiving such a notification? So i am looking for something like this: ActorSystem().createActor(actor_class, globalName=attr, keep_alive=True)

Why there is no "ask" method on an actor class? There is an actor system method "ask" but no corresponding method on an actor class. What is the reason of this? Without that how should look like normal processing like this:

start processing of received message A
ask some other actor for some additional data needed to process message A
after receiving additional data process message A

Without something like "await" are we going to some kind of callback hell? (or a message/response hell? ;-)) or is there something like: self.await( BankAccountData, self.send(BankActorAddress, AskForBankAccountData(client_number) )

self.await would stop an actor processing until there is a message of type BankAccountData that an actor receive from BankActorAddress

kquick commented 3 years ago

I don't mind multiple questions, although it might get somewhat confusing to have multiple threads of responses, so let us both feel free to split a question into a separate issue if things get confusing here.

1: Keeping an actor alive

There are a couple of ways you could do this: A. Create an actor P on startup whose only responsibility is to startup actor A. Actor A is no longer a top-level actor and has parent P, so P will be able to restart A if A exits. Clearly, this moves the concern to actor P, but if actor P does nothing else, then it's not likely to exit until the ActorSystem shuts down. B. Use the Thespian Director (https://thespianpy.com/doc/director.html)to manage the startup and functioning of your Actor System. This also requires shifting to use loadable actor implementations, but there are some significant other advantages that can be gained by that shift as well. In this mode, the Thespian Director provides the functionality of actor P described above, with the ability to perform additional message sending on startup, shutdown, and restart to help initialize or re-initialize your configuration.

2: No ask for the Actor class

The reason there is no ask is that ask is a blocking operation. At the fundamental level, an actor should:

receive a message
internally process the message
send one or more messages
exit the receive code (preparing it to return to #1). While it is performing steps 1-3, no other messages are delivered to the actor; this is what keeps actor code simple and avoids the need for mutexes or other parallelization synchronization techniques.

If the message received in step 1 cannot be handled in that single invocation, but must get a response from another actor to continue the processing, it should arrange to resume processing of the request when that response is received, but it should not block waiting for that response because it prevents it from handling other messages. To do this, you can either (a) store the original message internally to the actor, where it can be retrieved and processing can continue when the other actor's response is received, or (b) attach the original message to the request to the other actor and ensure that the other actor returns that original message as an attachment to the response. In general, (b) is the preferred mechanism to use, because it keeps the state management in the messages and not in the actors themselves; this is important because if an actor dies and is restarted, it can simply resume processing messages and doesn't require a re-initialization. Method (b) is also preferred because if the second actor never responds for some reason, the original request is not still consuming resources in the first actor (although an alternative is to use a self.wakeupAfter to send a failure on old messages that have not been responded to).

In short, the "await" technique is a mechanism used for process-based or thread-based synchronization techniques, whereas an actor-based approach does not use those types of mechanisms and is fundamentally oriented towards receive-respond message handling technique, so you will need to deal with "message/response hell", but you will never need to deal with synchronization and deadlocks.

htarnacki commented 3 years ago

hi, thanks for answers

i wil look into Thespian directory documentation. It looks very promising. btw documentation has some parts that need to be corrected: https://thespianpy.com/doc/director.html

this part is not correct:

{ 'version': 1,
  'formatters': {
    'normal': {
      'fmt': '%(asctime)s,%(msecs)d %(actorAddress}s/PID:%(process)d %(name)s %(levelname)s %(message)s',
      'datefmt': '%Y-%m-%d %H:%M:%S'
    }
  },
  'root': {'level': 20, 'handlers': ['foo_log_handler']},
  'loggers': {
    'subsysA': {'level':30, 'propagate': 0, 'handlers': ['foo_log_handler']}
  },
  'handlers': {
    'foo_log_handler': {
       'filename': 'fooapp.log',
       'class': 'logging.handlers.TimedRotatingFileHandler', 
       'formatter': 'normal', 
       'level': 20, 'backupCount':3, 
       'filters': ['isActorLog'], 
       'when': 'midnight'
    }
  }
  'filters': {'isActorLog': {'()', 'thespian.director.ActorAddressLogFilter'}}
}

it should be:

{ 'version': 1,
  'formatters': {
    'normal': {
      'fmt': '%(asctime)s,%(msecs)d %(actorAddress}s/PID:%(process)d %(name)s %(levelname)s %(message)s',
      'datefmt': '%Y-%m-%d %H:%M:%S'
    }
  },
  'root': {'level': 20, 'handlers': ['foo_log_handler']},
  'loggers': {
    'subsysA': {'level':30, 'propagate': 0, 'handlers': ['foo_log_handler']}
  },
  'handlers': {
    'foo_log_handler': {
       'filename': 'fooapp.log',
       'class': 'logging.handlers.TimedRotatingFileHandler', 
       'formatter': 'normal', 
       'level': 20, 'backupCount':3, 
       'filters': ['isActorLog'], 
       'when': 'midnight'
    }
  },
  'filters': {'isActorLog': {'()': 'thespian.director.ActorAddressLogFilter'}}
}

htarnacki commented 3 years ago

Is there any possibility to start actor system in a foreground? python -m thespian.director start --foreground

i build an application where there is one main "manager" node and undefined number of worker nodes. Each worker node is a docker image with thespian inside. I need to keep docker container alive and it would be the best to have main actor process: 430 1 0 09:04 ? 00:00:00 MultiProcAdmin ActorAddr-(T|:1900) to be the main container process

htarnacki commented 3 years ago

do i correctly understand this scenario?

i have 1 main thespian node running in a docker container. Thespian inside of it is started by: python -m thespian.director start

i have several thespian worker nodes running in their own docker containers. Thespian in each of them is run by: python -m thespian.director start

All of the nodes have joined to a convention with main thespian node as a leader

If i prepare some python package with some actor implementations and preprocess it with command "gensrc" then if i load this package in the main node container by: python -m thespian.director load foo-201710261938.tls then if one of my main actors starts in the leader node and invoke createActor with capabilities that will match exactly one of the worker nodes then sources from foo-201710261938.tls will be automatically loaded to this worker node? And the actor will be create from the class loaded in the foo-201710261938.tls "package" ?

kquick commented 3 years ago

Thank you for catching the issue in the director documentation. Just to be sure I didn't miss anything, the main issue was the missing comma between the handlers and filters sections?

Currently there is no provision to run the actor system as the main process. I would recommend just using a dummy action as your main container process.

The scenario you describe is correct: the loaded sources will automatically be transferred to between ActorSystems running in different containers on an as-needed basis to satisfy createActor requests. The director also provides support for running multiple versions of the same foo sources simultaneously; this can be useful for doing zero-downtime upgrades.

htarnacki commented 3 years ago

Screenshot_20210520-185310_Chrome

htarnacki commented 3 years ago

hi, i see in a documentation following statement:

The contents of each TLI file is a Python dictionary that is loaded via a Python eval()

Can i ask for the same behaviour but for other config files? I have file othercaps.cfg with content:

{
    'worker_node': True,
    'node_id': os.getenv('NODE_ID'),
    os.getenv('NODE_ID'): True
}

and after starting an actor system i see:

bash-5.1$ python -m thespian.director config
   Sources location: /home/worker/thespian
        System Base: multiprocTCPBase
         Admin Port: 1901
  Logging directory: /home/worker
  Convention Leader: 11.132.51.001
 Other Capabilities: {
                         'worker_node': True,
                         'node_id': os.getenv('NODE_ID'),
                         os.getenv('NODE_ID'): True
                     }

so either eval is not applied here or this command shows just plain text of a configuration loaded

kquick commented 3 years ago

The eval() is the actual behavior, it's just that the display at startup shows the raw file contents prior to eval() (in case the eval fails with an error). The configuration applied will be the results of the eval operation.

htarnacki commented 3 years ago

yes exactly ;-) i saw this after i finally was able to run two nodes in docker containers and connect them in a convention:

receiveMsg_ActorSystemConventionUpdate: ActorSystemConventionUpdate(remoteAdminAddress=ActorAddr-(T|173.81.0.3:1901), remoteAdded=True, remoteCapabilities={'worker_node': True, 'node_id': '8c1eb8e2-8807-4719-a7dc-0b3604794a56', '8c1eb8e2-8807-4719-a7dc-0b3604794a56': True,

so i suggest updating a documentation to emphasize the fact that config files are also evaluated ;-) It was not clear to me after reading current docs

ok so now i have a big success:

2 working docker nodes, master and slave
"gensrc" sources of my python package loaded to master node and first actor started on a master node ;-)

and very very minor problem: after starting master node i see some strange messages:

Log transmit record @ 20, level = 20
Log transmit record @ 20, level = 0
Log transmit record @ 20, level = 0
Log transmit record @ 20, level = 20
Log transmit record @ 20, level = 20
Log transmit record @ 20, level = 20
Log transmit record @ 20, level = 20
Log transmit record @ 20, level = 20
Log transmit record @ 20, level = 20
Log transmit record @ 20, level = 20

i don't know from where come that lines. Ok maybe I know that from thespian director but don't know why ;-)

and documentation needs one more correction. Here is the final and working logging configuration for thespian director:

{
    'version': 1,
    'formatters': {
        'normal': {
            'format': '%(asctime)s,%(msecs)d %(actorAddress)s/PID:%(process)d %(name)s %(levelname)s %(message)s',
            'datefmt': '%Y-%m-%d %H:%M:%S'
        }
    },
    'root': {'level': 20, 'handlers': ['file_log_handler']},
    'loggers': {
        'master': {'level': 30, 'propagate': 0, 'handlers': ['file_log_handler']},
        'slave': {'level': 30, 'propagate': 0, 'handlers': ['file_log_handler']}
    },
    'handlers': {
        'file_log_handler': {
            'filename': os.getenv('LOG_FILE', '~/logs/main.log'),
            'class': 'logging.handlers.TimedRotatingFileHandler',
            'formatter': 'normal',
            'level': 20,
            'backupCount': 3,
            'filters': ['isActorLog'],
            'when': 'midnight'
        }
    },
    'filters': {'isActorLog': {'()': 'thespian.director.ActorAddressLogFilter'}}
}

what i had to change additionally to make this working?

'fmt': -> 'format': (probably fmt was valid in python 2 ?)
%(actorAddress}s -> %(actorAddress)s (ending bracket)

htarnacki commented 3 years ago

I have a question: i did a test of starting first worker node and then a convention leader node. I wanted to test if convention update messages are delivered correctly in such a situation. I was expecting that worker node will actively looking for a convention leader to join to it after start. I have observed a very long interval (10 minutes) of retrying to join to a convention:

2021-05-24 08:19:33,795 ActorAddr-(T|:40721)/PID:16 root INFO receiveMsg_str: START ACTOR
2021-05-24 08:29:20,57 ActorAddr-(T|:40721)/PID:16 root INFO receiveMsg_ActorSystemConventionUpdate: ActorSystemConventionUpdate(remoteAdminAddress=ActorAddr-(T|172.80.0.2:1901), remoteAdded=True, remoteCapabilities={'ala': True, 'ala_id': '8c1eb8e2-8807-4719-a7dc-0b3604794a56', '8c1eb8e2-8807-4719-a7dc-0b3604794a56': True, 'Admin Port': 1901, 'DirectorFmt': [1], 'Convention Address.IPv4': '10.132.41.226:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 8, 10, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1621844353271'}

is value of this interval somehow configurable? I would like to lower it to 5 minutes

kquick commented 3 years ago

I think the Log transmit record @ 20, level = 20 lines are debug output that I've since removed? It's vaguely familiar, and I don't see anywhere in the current code that I think something like that would be generated. I'm going to be creating a new release shortly, so please let me know if you still see this in that release.

And you have my continued appreciation and apologies for the bad logging example in the director documentation: my guess is that perhaps I hand-copied that (badly!) from somewhere and introduced several syntax errors in the process. The upcoming release will include all your fixes to this.

If the convention member starts after the convention leader, there will be a delay. This is not currently configurable, although that's not unreasonable to consider. The current timing values are managed here: https://github.com/kquick/Thespian/blob/master/thespian/system/admin/convention.py#L20-L23. Current functionality is designed to be relatively conservative to avoid excessive network load during convention startup/shutdown; the assumption is that the convention can be quite large (these values were tested in a configuration with ~10,000 nodes) and that the convention tended to be long-running and reasonably well established prior to use (as opposed to your use case which is a fast startup and join scenario).

Please feel free to experiment with changing the above values in your local copy and we can then determine the best way to support making these configurable if the modifications provide the behavior you are looking for.

kquick commented 3 years ago

I've released version 3.10.5; let me know if that still has unusual output about log transmit records.

htarnacki commented 3 years ago

I've released version 3.10.5; let me know if that still has unusual output about log transmit records.

it looks like this version has resolved this problem

But i have another one ;-) Below you can find an output from an example project i have published here: https://github.com/kquick/Thespian/issues/78 I noticed strange behaviour every 10 minutes of running this example. It looks like every 10 minutes there are convention update events emitted where first is of type removed and second of type added:

2021-05-27 13:11:22,121 ActorAddr-(T|:35759)/PID:16 root INFO receiveMsg_str: START ACTOR
2021-05-27 13:11:22,123 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_str: START ACTOR
2021-05-27 13:11:22,126 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_SayHello: {"id": "87a1458d-60df-41a4-bc64-4667ccfaa7ea", "class": "{{eac377a7370eb0f8697a98e00f497a4d}}example.lib.Messages.implementations.SayHello.SayHello"}
2021-05-27 13:11:22,126 ActorAddr-(T|:34031)/PID:17 root INFO Hello from leader node
2021-05-27 13:11:28,152 ActorAddr-(T|:35759)/PID:16 root INFO receiveMsg_ActorSystemConventionUpdate: ActorSystemConventionUpdate(remoteAdminAddress=ActorAddr-(T|172.80.0.3:1901), remoteAdded=True, remoteCapabilities={'worker': True, 'node_id': '8c1eb8e2-8807-4719-a7dc-0b3604794a56', '8c1eb8e2-8807-4719-a7dc-0b3604794a56': True, 'Admin Port': 1901, 'DirectorFmt': [1], 'Convention Address.IPv4': '10.132.41.226:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 8, 10, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1622121088134'}
2021-05-27 13:11:28,157 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_WorkerNodeAdded: {"id": "de4070db-0340-4140-ac56-f6c1c5376824", "class": "{{eac377a7370eb0f8697a98e00f497a4d}}example.lib.Messages.implementations.WorkerNodeAdded.WorkerNodeAdded", "node_id": "8c1eb8e2-8807-4719-a7dc-0b3604794a56"}
2021-05-27 13:12:22,179 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_WakeupMessage: WakeupMessage(60, I am alive)
2021-05-27 13:13:22,239 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_WakeupMessage: WakeupMessage(60, I am alive)
2021-05-27 13:14:22,299 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_WakeupMessage: WakeupMessage(60, I am alive)
2021-05-27 13:15:22,359 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_WakeupMessage: WakeupMessage(60, I am alive)
2021-05-27 13:16:22,419 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_WakeupMessage: WakeupMessage(60, I am alive)
2021-05-27 13:17:22,479 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_WakeupMessage: WakeupMessage(60, I am alive)
2021-05-27 13:18:22,539 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_WakeupMessage: WakeupMessage(60, I am alive)
2021-05-27 13:19:22,599 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_WakeupMessage: WakeupMessage(60, I am alive)
2021-05-27 13:20:22,659 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_WakeupMessage: WakeupMessage(60, I am alive)
2021-05-27 13:21:22,719 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_WakeupMessage: WakeupMessage(60, I am alive)
2021-05-27 13:21:35,178 ActorAddr-(T|:35759)/PID:16 root INFO receiveMsg_ActorSystemConventionUpdate: ActorSystemConventionUpdate(remoteAdminAddress=ActorAddr-(T|172.80.0.3:1901), remoteAdded=False, remoteCapabilities={'worker': True, 'node_id': '8c1eb8e2-8807-4719-a7dc-0b3604794a56', '8c1eb8e2-8807-4719-a7dc-0b3604794a56': True, 'Admin Port': 1901, 'DirectorFmt': [1], 'Convention Address.IPv4': '10.132.41.226:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 8, 10, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1622121088134'}
2021-05-27 13:21:35,184 ActorAddr-(T|:35759)/PID:16 root INFO receiveMsg_ActorSystemConventionUpdate: ActorSystemConventionUpdate(remoteAdminAddress=ActorAddr-(T|172.80.0.3:1901), remoteAdded=True, remoteCapabilities={'worker': True, 'node_id': '8c1eb8e2-8807-4719-a7dc-0b3604794a56', '8c1eb8e2-8807-4719-a7dc-0b3604794a56': True, 'Admin Port': 1901, 'DirectorFmt': [1], 'Convention Address.IPv4': '10.132.41.226:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 8, 10, 'final', 0), 'Thespian Generation': (3, 10), 'Thespian Version': '1622121088134'}
2021-05-27 13:21:35,196 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_WorkerNodeRemoved: {"id": "457fd66f-6a36-4582-8ac6-2da110cf6128", "class": "{{eac377a7370eb0f8697a98e00f497a4d}}example.lib.Messages.implementations.WorkerNodeRemoved.WorkerNodeRemoved", "node_id": "8c1eb8e2-8807-4719-a7dc-0b3604794a56"}
2021-05-27 13:21:35,201 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_WorkerNodeAdded: {"id": "bc3a32d8-e44f-4659-bcab-96b7e24d27fa", "class": "{{eac377a7370eb0f8697a98e00f497a4d}}example.lib.Messages.implementations.WorkerNodeAdded.WorkerNodeAdded", "node_id": "8c1eb8e2-8807-4719-a7dc-0b3604794a56"}

of course i do not stop my worker node. It is still working so why every 10 minutes it seems that it is leaving the convention and immediatelly joining it once again? :

2021-05-27 13:21:35,196 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_WorkerNodeRemoved: {"id": "457fd66f-6a36-4582-8ac6-2da110cf6128", "class": "{{eac377a7370eb0f8697a98e00f497a4d}}example.lib.Messages.implementations.WorkerNodeRemoved.WorkerNodeRemoved", "node_id": "8c1eb8e2-8807-4719-a7dc-0b3604794a56"}
2021-05-27 13:21:35,201 ActorAddr-(T|:34031)/PID:17 root INFO receiveMsg_WorkerNodeAdded: {"id": "bc3a32d8-e44f-4659-bcab-96b7e24d27fa", "class": "{{eac377a7370eb0f8697a98e00f497a4d}}example.lib.Messages.implementations.WorkerNodeAdded.WorkerNodeAdded", "node_id": "8c1eb8e2-8807-4719-a7dc-0b3604794a56"}

@kquick could you run my example project and confirm that you can observe the same behaviour?

This behaviour exists either in version 3.10.5 and 3.10.4

kquick / Thespian

Different questions about Thespian and its functionalities #77