impresso / impresso-pycommons

Python module with bits of code (objects, functions) highly reusable within impresso.
http://impresso-pycommons.rtfd.io/
GNU Affero General Public License v3.0
3 stars 3 forks source link

Running rebuilt on runai #77

Closed piconti closed 10 months ago

piconti commented 11 months ago

In the past, the rebuilder script was run using Kubernetes to help speed up significantly the processing. This was done on the EPFL kubernetes Iccluster, using the dask-k8 package, since Dask did not have a kubernetes integration at the time.

Now, two main things have changed:

Hence, both the code and the configuration for running the rebuilt need to be changed to adapt to both of these changes. The solution should allow us to scale the processing to many workers, and allow to use quite large amounts of RAM.

piconti commented 11 months ago

Update:

Both dask-k8 and dask-kubernetes need to interact directly with kubernetes to either define a workolad or custom resources. Unfortunately, this is not possible with the new parametrisation of the ICcluster and RunAI. As a result, it will not be possible to use one of these libraries for the rebuilt text processing.

However, after some experiments it appears that when using RunAi it is not necessary to use dask-kubernetes to scale the processing to our needs. Indeed, simply using a dask LocalCluster on a pod from a RunAi job seems to allow sufficient parallelization and scaling to match our needs.

As a result, the following modifications need ot be done to the current impresso-commons package:

Ideally these changes will be done concurrently to changes removing the use of the depreciated boto library from the codebase, but this will be the object of another issue.

simon-clematide commented 11 months ago

Being able to work on a single machine (with enough CPU cores) is very preferable in my opinion. Overhead from distributed computing seemed a lot of times unnecessary high. This looks like an important simplification to me.

e-maud commented 11 months ago

True. If I am not mistaken the rebuilt code had and will continue to have the 2 possibilities, running on a single machine, and running on kubernetes-run-ai. The latter is more a way to define and package task executions which does not affect the capacity of single-core execution. tbc by Pauline, though.

piconti commented 11 months ago

What I was mentioning here in my comment is on a single node with many GPU cores. There will still be the possibility to run locally for small scale as the code is the same. RunAi also offers the possiblity for multi-node workloads that I sill look more into next week, but this is does not seem to be strictly necessary in this context.

e-maud commented 11 months ago

Just a memo on epfl side: after experiments with MPI, let's not forget to give an update to Carlos (ticket).