rivosinc / prometheus-slurm-exporter

Export select slurm metrics to prometheus
Apache License 2.0
26 stars 5 forks source link

Docker container #30

Closed KasperSkytte closed 6 months ago

KasperSkytte commented 8 months ago

Hello

I'm happy to see further development of a slurm prometheus exporter since we use Slurm v23.x, which is not supported by https://github.com/vpenso/prometheus-slurm-exporter. I know it's at an early state, but could you perhaps provide a list of supported slurm versions somewhere? And secondly a working docker container or just a Dockerfile? That would be the most ideal and convenient in my opinion. I'm happy to contribute at some point, but I won't have time before next year.

abhinavDhulipala commented 8 months ago

Howdy! I love the idea of everyone building and running their own docker container. The reason that's on pause is because it's very platform dependent based on the type of authentications users use with their slurm cluster and the platform they plan to run on. The more time I spent on it, the less I thought it was worth it because I haven't gotten any complaints about deployment. I figured go install was a simple enough route. I reckon when we add slurmrestd support, making a docker container will be trivial, anyway. That's why I had a musing, but never dedicated too much time to it. If you have any ideas on how to containerize, with munge auth, and slurm. I'd love to hear them. Maybe it's easier than I think. Off course, contributions are very welcome.

In terms of the supported slurm versions, we only know that it works for 21.XX and 23.XX, and we recently fixed a bug for 22 as well #26 , apart from that we aren't sure as the --json output is iterated on qiute often by SChedMD. But our cli fallback probably works for everything above 18.XX, I think (That's when the TRes last changed. I'm not super confident about that. Once again, was kind of just waiting for a complaint. and a version number. I know that's not a satisfying answer, but I haven't had a chance to test it on a bunch of slurm versions, or go through the change list and see when sinfo/squeue output formats have changed.

Thanks again!

KasperSkytte commented 8 months ago

Hi again. Alright. You could maybe in the code do a slurm version check initially and use separate implementations depending on it, but up to you. Reg Docker I'm willing to give it a go with bundling up a container at some point when I get to it. Right now I'm adapting the nvidia/deepops repo to set up slurm and it simply mounts slurm binaries and the munge key into the container directly from the host, see https://github.com/NVIDIA/deepops/blob/d248b658321eaf2d1adb8bc88fcb8408b48802e2/roles/prometheus-slurm-exporter/templates/docker.slurm-exporter.service.j2#L12. Similar to the "original" slurm exporter https://github.com/dholt/prometheus-slurm-exporter. To me a container is also ideal because the teardown/uninstall of it doesn't leave anything behind on the host. And of course nice with locked versions also of all dependencies. Anyways I'll gladly do a PR at some point if you don't get to it before me.

fluidnumerics-joe commented 8 months ago

I attempted a docker container for this repository and here are the main issues I came up against

This being said, deploying without a container was far easier.