Running rebuilt on runai

piconti commented 11 months ago

In the past, the rebuilder script was run using Kubernetes to help speed up significantly the processing. This was done on the EPFL kubernetes Iccluster, using the dask-k8 package, since Dask did not have a kubernetes integration at the time.

Now, two main things have changed:

Dask has dask-kubernetes that allows to manage a dask client on a kubernetes cluster
EPFL's kubernetes cluster has an added scheduling layer, called RunAi, from which we have to launch jobs on the kubernetes cluster.

Hence, both the code and the configuration for running the rebuilt need to be changed to adapt to both of these changes. The solution should allow us to scale the processing to many workers, and allow to use quite large amounts of RAM.

piconti commented 11 months ago

Update:

Both dask-k8 and dask-kubernetes need to interact directly with kubernetes to either define a workolad or custom resources. Unfortunately, this is not possible with the new parametrisation of the ICcluster and RunAI. As a result, it will not be possible to use one of these libraries for the rebuilt text processing.

However, after some experiments it appears that when using RunAi it is not necessary to use dask-kubernetes to scale the processing to our needs. Indeed, simply using a dask LocalCluster on a pod from a RunAi job seems to allow sufficient parallelization and scaling to match our needs.

As a result, the following modifications need ot be done to the current impresso-commons package:

[x] Remove the requirement of dask-k8 for the installation of the package
[x] Remove all mentions and uses of the library from the code
[x] Create scripts that can be run directly from the docker image, to streamline the process of launching a processing job
[x] Modify the Dockerfile (and image in the registry) to match these changes
[x] Include documentation and howtos for this new approach

Ideally these changes will be done concurrently to changes removing the use of the depreciated boto library from the codebase, but this will be the object of another issue.

simon-clematide commented 11 months ago

Being able to work on a single machine (with enough CPU cores) is very preferable in my opinion. Overhead from distributed computing seemed a lot of times unnecessary high. This looks like an important simplification to me.

e-maud commented 11 months ago

True. If I am not mistaken the rebuilt code had and will continue to have the 2 possibilities, running on a single machine, and running on kubernetes-run-ai. The latter is more a way to define and package task executions which does not affect the capacity of single-core execution. tbc by Pauline, though.

piconti commented 11 months ago

What I was mentioning here in my comment is on a single node with many GPU cores. There will still be the possibility to run locally for small scale as the code is the same. RunAi also offers the possiblity for multi-node workloads that I sill look more into next week, but this is does not seem to be strictly necessary in this context.

e-maud commented 11 months ago

Just a memo on epfl side: after experiments with MPI, let's not forget to give an update to Carlos (ticket).

impresso / impresso-pycommons

Running rebuilt on runai #77