rivosinc / prometheus-slurm-exporter

Export select slurm metrics to prometheus
Apache License 2.0
26 stars 5 forks source link

Support for multiple clusters #20

Closed isaac-aa closed 8 months ago

isaac-aa commented 9 months ago

Thanks for continuing the development of prometheus-slurm-exporter

We have a set of independent clusters, which can be controlled from a single machine using the --clusters option (in e.g., squeue or sinfo). This allows us to centralize the management whilst maintaining completely independent installations. We would like to use prometheus-slurm-exporter from this single machine to get information from all the clusters.

Our current homemade solution has been to add an extra option (-slurm.cluster <clustername>) to specify which cluster to export, and then run multiple instances of prometheus-slurm-exporter. As you can guess, this is hardly scalable after a few clusters and requires a lot of manual configuration (setting up ports,...).

Thus, the ideal solution for us would be to have prometheus-slurm-exporter to export the cluster information. We have done some explorations of this and it seems trivial for --json as the cluster field is already there. Only expanding the metrics fields to include a "cluster" field would suffice. For the fallback CLI, this may not be that easy as there is no cluster output when formatting with -o.

I wouldn't mind dedicating some time to this, but as I'm not very familiar with Go and this code, I think that someone with a deeper understanding of Go could implement this easily.

Thanks

isaac-aa commented 9 months ago

Just an update, we have chosen at the end to simply deploy separate exporter in each cluster and then gather the data in a single prometheus. It still requires deploying a few instances, but for now, it is manageable.

It would be okay if this issue is closed or marked low-priority.

abhinavDhulipala commented 9 months ago

Oh snap, sorry I missed this. For some reason, I didn't get a notification. We can add static labels for multi-cluster support. If you'd like to give it go be my guest. Currently, we believe launching an exporter per cluster is ideal as you can then configure each exporter independently. For example, if you need different poll limits or custom CLI opts for each cluster it'd be better to keep them independent. How do you guys deploy the exporter currently? With -json or the cli-fallback?

isaac-aa commented 8 months ago

No problem :)

Yes, I think that in the end it is more customizable to have an exporter per cluster. The installation of the exporter is now part of our procedures when deploying new clusters, so it is not very time-consuming to set up (and so far quite stable). We have been using the cli-fallback so far as we do not compile slurm with json.

abhinavDhulipala commented 8 months ago

Sounds good! I think the preferable approach here would be to use the slurmrestd and poll a list of urls. Do you have slurmrestd setup? I have no idea how to switch targeted clusters on the cli. If that's super easy, then if you could give the cmd line on how to do that/ref that'd be great. if its a cli opt, I think it'd be as simple as modifying the configs struct for each cluster and then registering each config. I could give it a go late next week if you think that approach makes sense

isaac-aa commented 8 months ago

We do not have slurmrestd setup yet. Switching clusters is quite easy from the cli with the --clusters option, which defaults to all. I have forked the repo and implemented the selection of the cluster to monitor here. This would still require to have multiple exporters running in a single system.

As I said before, we moved from this idea to just one exporter in each cluster independently, but I share it in case it may be useful for someone else.

abhinavDhulipala commented 8 months ago

Closing for now...