mle-infrastructure / mle-toolbox

Lightweight Tool to Manage Distributed ML Experiments 🛠
https://mle-infrastructure.github.io/mle_toolbox/toolbox/
MIT License
3 stars 1 forks source link

Local remote launch work via screen/tmux #16

Closed RobertTLange closed 3 years ago

RobertTLange commented 3 years ago

Currently when we launch remote jobs from a local machine this is done via a nested qsub/sbash command. The reconnect after a network disconnect then simply reads in the .txt CLI ouput file and prints it.

Ideally, we want to instead use qlogin/salloc and screen/tmux for the job submission and then reconnect by simply screen -r <screen-session-id> or the tmux equivalent.

I remember trying this but giving up at some point. I believe this was because of some challenges with piping commads to a qlogin session.

RobertTLange commented 3 years ago

There is a problem: After creating a screen session, we have to login to a node from which we launch/monitor the experiment. This logging-in procedure makes it impossible (?) to auto-exec an I/O call after it is completed (it gets somewhat lost in translation). We can circumvent this problem in two ways:

  1. Make the whole experiment monitoring so lightweight that we can do this all from the head node. This would involve not creating an individual experiment process for each seed. And to use the $SGE_TASK_ID batch submission setup (*).

  2. For the SGE cluster we could use the (newly discovered) ssh forwarding into nodes. But this is not the recommended way and can interfere with resource management of the scheduler.

Key Q again is on what level we want to monitor the batch of jobs.

(*) What is slurm analogue? Never looked it up.

RobertTLange commented 3 years ago

Okaydokey, one Dom exchange later and I have figured out that this should work from the headnode via qrsh 'command' (after having started a screen/tmux session). This will establish a connection to a remote host and execute the 'command'. Importantly, all I/O calls will be piped via the headnode. Hence, reattaching the multiplexer will reattach to the headnode which will pipe to the specific remote node. Check out more documentation here.

Note: It seems like one needs to pre-prend the bin/bash environment handling:

enable_conda = ('/bin/bash -c "source $(conda info --base)/etc/profile.d/conda.sh"'
                ' && conda activate {remote_env_name}')
enable_venv = '/bin/bash -c "source {}/{}/bin/activate"'

On Slurm the same should work via srun <args> --pty bash. Check out the great conversion guide for SGE to Slurm here.

RobertTLange commented 3 years ago

Addressed in #55. There is still something funky with the slurm scheduling. It appears that squeue-style job monitoring does not work from within an interactive srun session. TODO: Figure out how to circumvent this.