kabouzeid / turm

TUI for the Slurm Workload Manager
https://crates.io/crates/turm
MIT License
121 stars 4 forks source link

Job list is empty on Slurm 18.08 #17

Open fleimgruber opened 1 year ago

fleimgruber commented 1 year ago

As a user, running turm shows the TUI with the 3 main panes, but without any jobs. No keyboard press has a visible effect, only q for quitting.

I compiled turm myself and we use Slurm 18.08. Is it maybe a compatibility issue?

kabouzeid commented 1 year ago

It just parses the output of squeue. So if squeue works, so should turm. Not sure what's going on.

fleimgruber commented 1 year ago

Can you give me a hint on how to best debug this from turm? I triple checked that squeue without args gives me a job list, but turm does not. Since I only have a CLI available I tried rust-gdb, but it's output interferes with the turm TUI.

fleimgruber commented 1 year ago

I am not intending to sound cheeky, but if there were automated tests shipped with turm, I could try running these on the SLURM system...

kabouzeid commented 1 year ago

would love to have tests, but then you have to somehow setup a clean slurm environemnt and start dummy jobs there. not sure how to best do that.

this is the part you need to debug: https://github.com/kabouzeid/turm/blob/f104c7c646880f3881a99fa183ce5165cbf8c5b3/src/job_watcher.rs#L53-L133

fleimgruber commented 1 year ago

True, maybe this could provide a clean environment for testing? https://hub.docker.com/r/hpcnow/slurm_simulator

Failing that I could also see a set of test job definitions maintained here to be run against an existing production Slurm installation that could be used for very basic testing, e.g. a few sleep jobs that print to stdout so that at least parts of the UI are tested.

Regarding the part to debug: I do not yet have a CLI debugging setup for Rust. Another idea that came to mind: there is a feature of other Slurm TUIs to use SSH to connect to a Slurm host so the TUI would run locally and could then be more easily debugged, e.g. visual debugger in VS Code. Did you think about remote Slurm access? Do you have experience with SSH in Rust?

kabouzeid commented 1 year ago

You can use the remote SSH VS Code extension for running and debugging on the slurm host.

fleimgruber commented 12 months ago

Thanks for mentioning, a good idea! I tried debugging in VS Code which tells me to install LLDB extensions. After that LLDB fails with version `GLIBC_2.18' not found. Slurm is running on CentOS 7 which only has glibc 2.17. I think also other Rust dev tools need at least glibc 2.18? See also https://github.com/rust-lang/rust-analyzer/issues/4706.

fleimgruber commented 12 months ago

In the meantime, I would try "printf-debugging", but written to a file because stdout will be drawn with TUI main loop already. I have this template:

let path = "results.txt";
let mut output = File::create(path)?;
let job_command = ...
write!(output, "{}", job_command)

Could you provide guidance on what to insert at ... from jobs to get the full squeue command that will be tried?

kabouzeid commented 12 months ago

Just debug print the Command with

let cmd = Command::new("squeue") 
     .args(&self.squeue_args) 
     .arg("--array") 
     .arg("--noheader") 
     .arg("--Format") 
     .arg(&output_format)

println!("{:?}", cmd);
fleimgruber commented 12 months ago

For me it only works with

let cmd = Command::new("squeue")
      .args(&self.squeue_args)
      .arg("--array")
      .arg("--noheader")
      .arg("--Format")
      .arg(&output_format)
      .output();
println!("{:?}", cmd);

which prints a string with the expected comma-separated fields.

fleimgruber commented 12 months ago

Ok, I could further narrow it down to this check: https://github.com/kabouzeid/turm/blob/f104c7c646880f3881a99fa183ce5165cbf8c5b3/src/job_watcher.rs#L67 which always evaluates to true so it always returns None and never the Job.

fleimgruber commented 12 months ago

And the actual cause I think is that: https://github.com/kabouzeid/turm/blob/f104c7c646880f3881a99fa183ce5165cbf8c5b3/src/job_watcher.rs#L65 does not split at ###turm### because it is not included in the output of squeue.

It seems that the expectation with respect to Slurm output is not met, i.e.:

squeue --array --noheader --Format jobid:###turm###

prints only the jobids to STDOUT. The manpages of the installed squeue and newer squeue differ:

@@ -1 +1 @@
-The format of each field is "type[:[.][size][suffix]]"
\ No newline at end of file
+The format of each field is "type[:[.][size]]"
\ No newline at end of file
fleimgruber commented 12 months ago

So as mentioned in OP, it actually is a compatibility issue with Slurm 18.08. Do you see another way to do the string post-processing? E.g. split on a tab or a certain amount of blanks instead of the ###turm### sentinel.

Edit: I see now that the only way to parse the output is to not use the --noheader argument and look for the header column positions to correctly infer the field offsets for the actual output lines.

kabouzeid commented 12 months ago

Thanks for tracking this down!

Edit: I see now that the only way to parse the output is to not use the --noheader argument and look for the header column positions to correctly infer the field offsets for the actual output lines.

If someone implements this in a robust enough way, I would be willing to merge it. I won't have time to do this myself.

fleimgruber commented 11 months ago

I went ahead and implemented my suggested approach from https://github.com/kabouzeid/turm/issues/17#issuecomment-1768644298 in #20