Open vsoch opened 2 years ago
okay I have spindle tests started running, and I think I might not have enough resources because my tiny cluster hangs on:
# ./runTests
Running: ./run_driver --dependency --push
srun: Requested partition configuration not available now
srun: job 3 queued and waiting for resources
What does spindle require for resources given slurm testing?
Going to try openmpi now
When I try testing with openmpi:
Spindle Error: Could not identify system job launcher in command line
Running: ./run_driver --dlopen --preload
and then the same error about not being able to connect to a session.
If you were using spindle with slurm 20.11+, then I just pushed a fix for running spindle with that version of slurm to devel. The issue could have produced the hang you were seeing.
Quick test of a build and I'm seeing:
#18 1.985 checking slurm version for compatibility... no
#18 1.994 configure: error: Slurm support was requested, but slurm 20.11.8, which is later than 20.11, was detected. This version of slurm breaks spindle daemon launch. You can disable this error message and build spindle with slurm-based daemon launching anyways by explicitly passing the --with-slurm-launch option (you might still be able to get spindle to work by running jobs with srun's --overlap option). Or you could switch to having spindle launch daemons with rsh/ssh by passing the --with-rsh-launch option, and ensuring that rsh/ssh to nodes works on your cluster.
I'll listen to the message and try out those various options, probably not right now because I'm tired, but will update here with what I find.
okay - so I gave a shot to rebuild and add the --with-slurm-launch
option for 20.11.8. That compiled correctly removing the previous error message, but I had other issues with getting the slurm cluster working in docker-compose - I didn't want to confound these possible new issues with spindle so I fell back to an older version of slurm, 18.x.x. Seeing that my previous message job 3 was queued and waiting for resources, I tried this again and then looked at the queue and I see:
$ docker exec -it slurmdbd bash
[root@slurmdbd /]# squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
3 normal spindle_ root PD 0:00 2 (PartitionNodeLimit)
So can I ask again - how many concurrent nodes are required for spindle to run tests with slurm?
What command do you usually use for pynamic? I can try that instead.
Okay this looks to work for pynamic, although still no go to add spindle.
$ time python config_pynamic.py 30 1250 -e -u 350 1250 -n 150
************************************************
summary of pynamic-sdb-pyMPI executable and 10 shared libraries
Size of aggregate total of shared libraries: 2.5MB
Size of aggregate texts of shared libraries: 6.8MB
Size of aggregate data of shared libraries: 408.4KB
Size of aggregate debug sections of shared libraries: 0B
Size of aggregate symbol tables of shared libraries: 0B
Size of aggregate string table size of shared libraries: 0B
************************************************
real 21m33.556s
user 14m54.538s
sys 3m31.206s
What's happening here is that there's a bug/feature in Slurm 20.11+ that makes it so Spindle can't launch its daemons with Slurm. The "checking slurm version for compatibility... no" means you're hitting that. There's two autoconf-level options:
You'll probably have to use option 2 here. Or you could regress your slurm version.
And I'd usually run pynamic based on the README.md commands in its repo. So something like: srun pyMPI pynamic_driver.py date +%s
I'm getting errors in testing and attempted usage that Spindle cannot connect to some session. I'm installing as follows:
And I've tried that with both slurm and openmpi as the "testrm" And then I make the tests
but no matter what I do (using the slurm or openmpi template, both of which I have) I see this error:
I saw this same error in trying to just use spindle so I've gone back to the tests to debug. Note that I do have a /tmp area:
Update: I think it could possibly be that they need to see the same /tmp area - so I'm rebuilding the containers with a shared /tmp area and will report back.