hpc / Spindle

Scalable dynamic library and python loading in HPC environments
Other
94 stars 22 forks source link

Spindle could not connect to session #44

Open vsoch opened 2 years ago

vsoch commented 2 years ago

I'm getting errors in testing and attempted usage that Spindle cannot connect to some session. I'm installing as follows:

./configure --with-munge-dir=/etc/munge --enable-sec-munge --with-slurm-dir=/etc/slurm --with-testrm=slurm
make
make install

And I've tried that with both slurm and openmpi as the "testrm" And then I make the tests

cd testsuite
make
./runTests

but no matter what I do (using the slurm or openmpi template, both of which I have) I see this error:

Running: ./run_driver --partial --session
ERROR: Spindle could not connect to session tn2VYQ

I saw this same error in trying to just use spindle so I've gone back to the tests to debug. Note that I do have a /tmp area:

 ls /tmp/
ccFjQGLR.s  ks-script-eC059Y  spin.kT6PPu  spin.tn2VYQ  spin.Un7RTL  yum.log

Update: I think it could possibly be that they need to see the same /tmp area - so I'm rebuilding the containers with a shared /tmp area and will report back.

vsoch commented 2 years ago

okay I have spindle tests started running, and I think I might not have enough resources because my tiny cluster hangs on:

# ./runTests 
Running: ./run_driver --dependency --push
srun: Requested partition configuration not available now
srun: job 3 queued and waiting for resources

What does spindle require for resources given slurm testing?

vsoch commented 2 years ago

Going to try openmpi now

vsoch commented 2 years ago

When I try testing with openmpi:

Spindle Error: Could not identify system job launcher in command line
Running: ./run_driver --dlopen --preload

and then the same error about not being able to connect to a session.

mplegendre commented 2 years ago

If you were using spindle with slurm 20.11+, then I just pushed a fix for running spindle with that version of slurm to devel. The issue could have produced the hang you were seeing.

vsoch commented 2 years ago

Quick test of a build and I'm seeing:

#18 1.985 checking slurm version for compatibility... no
#18 1.994 configure: error: Slurm support was requested, but slurm 20.11.8, which is later than 20.11, was detected.  This version of slurm breaks spindle daemon launch.  You can disable this error message and build spindle with slurm-based daemon launching anyways by explicitly passing the --with-slurm-launch option (you might still be able to get spindle to work by running jobs with srun's --overlap option).  Or you could switch to having spindle launch daemons with rsh/ssh by passing the --with-rsh-launch option, and ensuring that rsh/ssh to nodes works on your cluster.

I'll listen to the message and try out those various options, probably not right now because I'm tired, but will update here with what I find.

vsoch commented 2 years ago

okay - so I gave a shot to rebuild and add the --with-slurm-launch option for 20.11.8. That compiled correctly removing the previous error message, but I had other issues with getting the slurm cluster working in docker-compose - I didn't want to confound these possible new issues with spindle so I fell back to an older version of slurm, 18.x.x. Seeing that my previous message job 3 was queued and waiting for resources, I tried this again and then looked at the queue and I see:

$ docker exec -it slurmdbd bash
[root@slurmdbd /]# squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 3    normal spindle_     root PD       0:00      2 (PartitionNodeLimit)

So can I ask again - how many concurrent nodes are required for spindle to run tests with slurm?

vsoch commented 2 years ago

What command do you usually use for pynamic? I can try that instead.

vsoch commented 2 years ago

Okay this looks to work for pynamic, although still no go to add spindle.

$ time python config_pynamic.py 30 1250 -e -u 350 1250 -n 150

************************************************
summary of pynamic-sdb-pyMPI executable and 10 shared libraries
Size of aggregate total of shared libraries: 2.5MB
Size of aggregate texts of shared libraries: 6.8MB
Size of aggregate data of shared libraries: 408.4KB
Size of aggregate debug sections of shared libraries: 0B
Size of aggregate symbol tables of shared libraries: 0B
Size of aggregate string table size of shared libraries: 0B
************************************************

real    21m33.556s
user    14m54.538s
sys 3m31.206s
mplegendre commented 2 years ago

What's happening here is that there's a bug/feature in Slurm 20.11+ that makes it so Spindle can't launch its daemons with Slurm. The "checking slurm version for compatibility... no" means you're hitting that. There's two autoconf-level options:

  1. Build with "--with-slurm-launch", which tells Spindle to build anyways and still try to use slurm. But without a slurm fix, this is unlikely to get anywhere.
  2. Use Spindle's rsh launching mode with the "--with-rsh-launch" option. If you have multiple nodes in your cluster, and configure them so that rsh or ssh can execute commands without passwords across nodes, then Spindle can use this to start its daemons.

You'll probably have to use option 2 here. Or you could regress your slurm version.

And I'd usually run pynamic based on the README.md commands in its repo. So something like: srun pyMPI pynamic_driver.py date +%s