Open fleimgruber opened 1 year ago
It just parses the output of squeue. So if squeue works, so should turm. Not sure what's going on.
Can you give me a hint on how to best debug this from turm? I triple checked that squeue without args gives me a job list, but turm does not. Since I only have a CLI available I tried rust-gdb, but it's output interferes with the turm TUI.
I am not intending to sound cheeky, but if there were automated tests shipped with turm, I could try running these on the SLURM system...
would love to have tests, but then you have to somehow setup a clean slurm environemnt and start dummy jobs there. not sure how to best do that.
this is the part you need to debug: https://github.com/kabouzeid/turm/blob/f104c7c646880f3881a99fa183ce5165cbf8c5b3/src/job_watcher.rs#L53-L133
True, maybe this could provide a clean environment for testing? https://hub.docker.com/r/hpcnow/slurm_simulator
Failing that I could also see a set of test job definitions maintained here to be run against an existing production Slurm installation that could be used for very basic testing, e.g. a few sleep jobs that print to stdout so that at least parts of the UI are tested.
Regarding the part to debug: I do not yet have a CLI debugging setup for Rust. Another idea that came to mind: there is a feature of other Slurm TUIs to use SSH to connect to a Slurm host so the TUI would run locally and could then be more easily debugged, e.g. visual debugger in VS Code. Did you think about remote Slurm access? Do you have experience with SSH in Rust?
You can use the remote SSH VS Code extension for running and debugging on the slurm host.
Thanks for mentioning, a good idea! I tried debugging in VS Code which tells me to install LLDB extensions. After that LLDB fails with version `GLIBC_2.18' not found. Slurm is running on CentOS 7 which only has glibc 2.17. I think also other Rust dev tools need at least glibc 2.18? See also https://github.com/rust-lang/rust-analyzer/issues/4706.
In the meantime, I would try "printf-debugging", but written to a file because stdout will be drawn with TUI main loop already. I have this template:
let path = "results.txt";
let mut output = File::create(path)?;
let job_command = ...
write!(output, "{}", job_command)
Could you provide guidance on what to insert at ...
from jobs
to get the full squeue
command that will be tried?
Just debug print the Command
with
let cmd = Command::new("squeue")
.args(&self.squeue_args)
.arg("--array")
.arg("--noheader")
.arg("--Format")
.arg(&output_format)
println!("{:?}", cmd);
For me it only works with
let cmd = Command::new("squeue")
.args(&self.squeue_args)
.arg("--array")
.arg("--noheader")
.arg("--Format")
.arg(&output_format)
.output();
println!("{:?}", cmd);
which prints a string with the expected comma-separated fields.
Ok, I could further narrow it down to this check:
https://github.com/kabouzeid/turm/blob/f104c7c646880f3881a99fa183ce5165cbf8c5b3/src/job_watcher.rs#L67
which always evaluates to true
so it always returns None and never the Job
.
And the actual cause I think is that:
https://github.com/kabouzeid/turm/blob/f104c7c646880f3881a99fa183ce5165cbf8c5b3/src/job_watcher.rs#L65
does not split at ###turm###
because it is not included in the output of squeue
.
It seems that the expectation with respect to Slurm output is not met, i.e.:
squeue --array --noheader --Format jobid:###turm###
prints only the jobid
s to STDOUT. The manpages of the installed squeue
and newer squeue
differ:
@@ -1 +1 @@
-The format of each field is "type[:[.][size][suffix]]"
\ No newline at end of file
+The format of each field is "type[:[.][size]]"
\ No newline at end of file
So as mentioned in OP, it actually is a compatibility issue with Slurm 18.08. Do you see another way to do the string post-processing? E.g. split on a tab or a certain amount of blanks instead of the ###turm###
sentinel.
Edit: I see now that the only way to parse the output is to not use the --noheader
argument and look for the header column positions to correctly infer the field offsets for the actual output lines.
Thanks for tracking this down!
Edit: I see now that the only way to parse the output is to not use the --noheader argument and look for the header column positions to correctly infer the field offsets for the actual output lines.
If someone implements this in a robust enough way, I would be willing to merge it. I won't have time to do this myself.
I went ahead and implemented my suggested approach from https://github.com/kabouzeid/turm/issues/17#issuecomment-1768644298 in #20
As a user, running
turm
shows the TUI with the 3 main panes, but without any jobs. No keyboard press has a visible effect, only q for quitting.I compiled turm myself and we use Slurm 18.08. Is it maybe a compatibility issue?