troycomi / reportseff

Tabular seff
MIT License
66 stars 8 forks source link

Feature: Handle multi-cluster environments #44

Open angel-devicente opened 11 months ago

angel-devicente commented 11 months ago

I'm using reportseff in a multi-cluster environment, and from a machine that is running slurmdbd but where slurmctld daemons are running in other machines. For this, when I run sacct I have no issue, since I can do sacct -M <server> or even sacct -M all, which is very handy for central administration of the multi-cluster environment.

With reportseff, I can do something similar reportseff --extra-args "-M <server>", but as it is, it is not working because db_inquirer.py issues the command command_args = "scontrol show partition".split(), which should get the -M <server> specification as well.

Giving the whole extra-args to the scontrol command is not, I suppose, a good option, since sacct and scontrol don't share the same args. Perhaps a new "-M" parameter can be added, so if provided it is simply added to both the sacct and scontrol commands? I would add it myself, but not sure what is done with the partition information in reportseff, and how you would handle the situation when "-M" refers to more than one server.

troycomi commented 11 months ago

Can you specify what is "not working"? Currently, scontrol is just utilized to get the time limits for each partition. If you don't use a partition time limit it should be a no-op and on multiple clusters it could cause some inaccuracies if the same partition is specified multiple times with different time limits.

Here are my first thoughts:

The main issue I foresee with this is a call for -M all may clobber job ids. reportseff uses the jobid as a unique identifier and if clusters each have the same job id the sacct parsing will get mangled.

Feel free to add the server option. Handling duplicate job ids is more challenging as it would require a significant rewrite of jobs and job collections (which is needed to handle retries anyways).

angel-devicente commented 11 months ago

By "not working" I meant that since I'm running reportseff in a machine where no slurmctld is running, the scontrol command just hangs.

Option two is what I had in mind (being able to add something lime "-M " which would then be passed onto both sacct and scontrol commands inside reportseff.

When I have some time, I will add this and submit a PR.