netj / 3x

3X — a Workbench for eXecutable eXploratory eXperiments
http://netj.github.io/3x/
21 stars 4 forks source link

Execute on servers with dedicated job scheduler #27

Open jewellsean opened 10 years ago

jewellsean commented 10 years ago

Under the current framework, is it possible to send all planned jobs to servers with a dedicated job scheduler? The workflow would likely be as follows:

1) Plan runs locally (through cross product interface). 2) Send and schedule jobs on the remote server. --- Completely disconnect from server, since large jobs will require at least a week to complete. It is not reasonable to require a consistent connection over that time period. --- 3) Re-establish a connection and sync results.

netj commented 10 years ago

Hi @jewellsean, I'm glad you are looking for what I recently added to the codebase. In my view, you are asking for two features: asynchronous execution of runs and job scheduler support. The former is in the master branch already, but the latter is not there yet.

Using the LATEST version, you can define a ssh-cluster type target to schedule runs to one or more remote machines via plain ssh, then synchronize later as you want. If you have a set of machines that you have ssh access to, defining and using a ssh-cluster target will serve your needs well. This feature is undocumented yet, but you can get basic hint how to define one by running 3x target NAME define ssh-cluster. One thing to remember is that the same version of 3x should be installed on the remote machines.

However, if your cluster has a more sophisticated resource scheduler and/or disallows direct ssh access to individual machines, ssh-cluster target won't be that useful. If you have a specific job scheduler you want to use, and can provide some info how you submit and check your jobs, I'd be happy to add some code to 3X to directly support that.

jewellsean commented 10 years ago

@netj, thank you for your detailed and quick response! Unfortunately, I am unable to access individual machines directly via ssh.

If you're willing to add some functionality for systems reliant on a job scheduler it would be greatly appreciated, and I can help where possible. The most comprehensive description of the scheduling environment can be found here, but I will also summarize. Essentially, I initially scp or rsync both input files and executables to the server's filesystem and then create a simple pbs script which specifies server specific parameters like walltime, cpu requirements etc. This script is then submitted to the job scheduler. The job scheduler has some built in commands to check the status of jobs, for example, 'showq -u 'username'' would list all active / queueing / blocked jobs.

If example scripts are helpful let me know. I will also help test these features. Thanks!