Closed RobertTLange closed 3 years ago
There is a problem: After creating a screen
session, we have to login to a node from which we launch/monitor the experiment. This logging-in procedure makes it impossible (?) to auto-exec an I/O call after it is completed (it gets somewhat lost in translation). We can circumvent this problem in two ways:
Make the whole experiment monitoring so lightweight that we can do this all from the head node. This would involve not creating an individual experiment process for each seed. And to use the $SGE_TASK_ID
batch submission setup (*).
For the SGE cluster we could use the (newly discovered) ssh forwarding into nodes. But this is not the recommended way and can interfere with resource management of the scheduler.
Key Q again is on what level we want to monitor the batch of jobs.
(*) What is slurm analogue? Never looked it up.
Okaydokey, one Dom exchange later and I have figured out that this should work from the headnode via qrsh 'command'
(after having started a screen
/tmux
session). This will establish a connection to a remote host and execute the 'command'. Importantly, all I/O calls will be piped via the headnode. Hence, reattaching the multiplexer will reattach to the headnode which will pipe to the specific remote node. Check out more documentation here.
Note: It seems like one needs to pre-prend the bin/bash environment handling:
enable_conda = ('/bin/bash -c "source $(conda info --base)/etc/profile.d/conda.sh"'
' && conda activate {remote_env_name}')
enable_venv = '/bin/bash -c "source {}/{}/bin/activate"'
On Slurm the same should work via srun <args> --pty bash
. Check out the great conversion guide for SGE to Slurm here.
Addressed in #55. There is still something funky with the slurm scheduling. It appears that squeue
-style job monitoring does not work from within an interactive srun
session. TODO: Figure out how to circumvent this.
Currently when we launch remote jobs from a local machine this is done via a nested
qsub
/sbash
command. Thereconnect
after a network disconnect then simply reads in the.txt
CLI ouput file and prints it.Ideally, we want to instead use
qlogin
/salloc
andscreen
/tmux
for the job submission and then reconnect by simplyscreen -r <screen-session-id>
or the tmux equivalent.I remember trying this but giving up at some point. I believe this was because of some challenges with piping commads to a
qlogin
session.