powsybl / powsybl-hpc

High Performance Computing modules for powsybl
https://www.powsybl.org
Mozilla Public License 2.0
2 stars 0 forks source link

[Slurm] Alternatives for job completion monitoring #25

Open sylvlecl opened 4 years ago

sylvlecl commented 4 years ago

Feature

In order to monitor the completion of jobs submitted to Slurm, we use files and filesystem polling. Depending on the polling frequency, this introduces some performance cost (delay between the end of the task and the time when the computation manager identifies it as completed), and some load on the underlying filesystem, in particular when multiple processes using a computation manager are running.

We could be able to configure the way the completion monitoring is performed. Polling will be one implementation of this functionality.

Other interesting implementations would be :

  1. A very simple in house networking protocol, for example implemented with netty.
  2. Using a message broker (kafka, rabbitmq ...) : this should probably be left for implementation by client projects

Improving perceived performances while relieving the filesystem.

yichen88 commented 4 years ago

If slurm is in local mode, we can simply register a WatchService on flagDir.

sylvlecl commented 4 years ago

Yes, but the problem is that even in "local" mode, there are good chances that the flag dir is actually on a shared filesystem, for instance a nfs mount, so that slurm nodes can access it. In that case, the watch service will probably not work (or be implemented with polling).