Should we start a new Python for each process ?

sapetnioc commented 1 year ago

A short test with the new v3 API using serial builtin engine on a very small pipeline (TinyMorphologist) iterated 180 times showed that the overhead of Capsul infrastructure is greatly reduced (7.5 times) if the same Python is reused instead of launching a new one for each Process. The overhead per process drops from 0.93 second to 0.223 second.

Almost 1s per process (i.e. per pipeline node) starts to be an important overhead if processes are short. Should we keep the rule to start a new Python for each process ? If not, we could reuse the client Python (in builtin serial engine) or the Python of the workers (Celery, Soma-Workflow, etc.) by default. Then we would have to define a way to ask for a new Python for each process. This definition could go either at process/pipeline creation level or in engine configuration.

denisri commented 1 year ago

Well, this will not be easy to avoid in all cases:

computing resource running jobs via a batch submission system are "forced" to run a new process for each job. Anyway a job in this context is supposed to be something with a quite large granularity, because jobs submission may take a while also, then may be queued and wait (sometimes several minutes, not to say hours). We have always thought of a Process/Job to be something lasting (in average) at least 5-10 seconds.
Configuration of some software may require setting global variables (nipype for instance) and may not be easily reusable within the same python process. Some modules (even in our codes) make use of singletons, and don't expect to be re-used across multiple "applications".
For the same reason, such processing in the same process has to be strictly serial and has to be very careful when threads are used within a process
Parallel and multiple CPU/threads/core jobs requiring different numbers of CPUs will be difficult to handle/balance in a single worker process
Error handling will be more difficult when a process crashes during a sequence of several "jobs" and kills a whole worker

We could think of doing it in very precise situations, bout this would make special cases...

On the other hand, starting python is normally fast (at least when it has already been run on a machine/node and is still in cache: for instance time python -c '' is between 0.03 and 0.05s on my laptop). What takes time is import and initialization of multiple complex modules, which we probably do in Capsul. Maybe one possible track is to try not importing unnecessary modules when they are not required (I mean, for instance, not import config machinery and modules when a job doesn't need any config, or completion system, etc.). Moving "too much" of the machinery to jobs is maybe a cause of the overheads: for instance having jobs to open the database and register their ouputs themselves, then starting downstrem jobs themselves, requires to load databasing modules and perform database IO in every job... What was the necessity for this, again ? It's required for celery ? it's even not sure all this will be possible/allowed in all cluster infrastructures... What's more important: using Celery, or being able to run jobs on a cluster ? We never took time to seriously discuss all these points...

sapetnioc commented 1 year ago

There is no Capsul machinery difference in my two tests. The exact same function (with the same database connection, etc.) is called either directly or via python -m. The only difference relies on starting Python, loading modules and getting three string parameters (to date: two from env vars and one from sys.argv). The only thing I can say is that on this serial exemple there is a mean time per process of 0.22s with a single Python. I cannot tell the proportion of real processing time v.s. Capsul overhead yet (each process read its input and write to its output).

It will always be possible to start a new command for each process. The question of this issue is irrelevant for cases relying on a job submission system. But for local usage (either serial or parallel) we could choose to reuse an already running Python unless process developer (via a process attribute) or user (via engine configuration) ask otherwise.

The use of a database was chosen for communication between jobs and with client. It was not possible to just adapt Soma-Workflow (that already have a database and a communication protocol) to be able to port it to existing workflow systems. For this goal, it was necessary to rethink completely the inter process communication. This is what I have done in Capsul. But it has nothing to do with Celery. Celery is just an POC implementation of a local parallel engine. Celery engine requires a Redis database because I choose to use it for both for Celery and Capsul communication/storage but we could separate them and allow celery engine to use any database for Capsul part. For instance, builtin serial engine can accept any implemented database system (to date SQlite or Redis).

sapetnioc commented 1 year ago

Now the running of a process goes through the run_job function that takes a same_pythonboolean parameter allowing to decide to use the Python of the worker or run a new one. To date, only the default value is used but I plan to connect this value to engine configuration. The only remaining question would be : "What should be the default ?".

populse / capsul

Should we start a new Python for each process ? #250