vpenso / prometheus-slurm-exporter

Prometheus exporter for performance metrics from Slurm.
GNU General Public License v3.0
227 stars 141 forks source link

Nested accounts missing from fairshare #93

Open Xaraxia opened 1 year ago

Xaraxia commented 1 year ago

Hi,

We have a nested account arrangement, and those accounts aren't properly being reported on.

I dug into the code, and the command is:

$ sshare -n -P -o account,fairshare
root|0.500000
 top_1|0.999998
  nested_1_1|0.999998
  nested_1_2|1.000000
   nested_1_2_1|1.000000
 top_2|0.481723
  nested_2_1|0.858038
   nested_2_2|0.961831

However when I get the metrics, I only get root, top_1 and top_2.

'root' isn't useful. top accounts are useful as an aggregate, but I'd also like to see the nested accounts.

Ideally, we would have "slurm_account_fairshare" as it is, and also offer "slurm_subaccount_fairshare" so that I could graph both.

Looks like ParseFairShareMetrics() is the culprit, throwing away anything that starts with more than one space.

                if ! strings.HasPrefix(line,"  ") {

I can see the argument for doing it, hence my proposal to gather two sets of metrics.

Xaraxia commented 1 year ago

This is what is actually coming out of the exporter:

slurm_account_fairshare{account="top_1"} 0.999998
slurm_account_fairshare{account="root"} 1
slurm_account_fairshare{account="top_2"} 0.481723

So perhaps the right answer is to do

slurm_account_fairshare{account="root"} 1
slurm_account_fairshare{account="top_1", parent_account="root", account_depth="1"} 0.999998
slurm_account_fairshare{account="nested_1_2", parent_account="top_1", account_depth="2"} 1.000000
slurm_account_fairshare{account="nested_1_2_1", parent_account="nested_1_2", account_depth="3"} 1.000000

I'm happy to cut some code to do this if you can give me some recommendations.

optiz0r commented 7 months ago

Tangentally related, but noting here in case anyone journeys past here looking for it as I did. I was looking into something similar, where fairshare metrics were missing from all accounts. When the fair tree fairshare algorithm is used (changed in slurm 19.05+ to be the default), sshare makes no attempt to calculate a fairshare metric for anything other than users directly. For accounts, a (double)NO_VAL64 is hardcoded, and this appears to be rendered as a blank: https://github.com/SchedMD/slurm/blob/master/src/sshare/process.c#L261

This manifests as the exported reporting 0 for all accounts. We considered patching the exporter to report back LevelFS instead, which is produced by sshare for accounts, but not sure how best to deal with infinity.