combining multiple job starts

libAtoms / ExPyRe

Execute Python Remotely

GNU General Public License v2.0

6 stars 3 forks source link

combining multiple job starts #12

Open bernstei opened 2 years ago

bernstei commented 2 years ago

On some remote machines just the ssh connection is somewhat slow. It would be nice if multiple job start commands could be combined, perhaps by gathering all the remote commands into an array of strings, and then running all of them in a single ssh connection.

bernstei commented 2 years ago

Note - it's unclear, in retrospect, what makes these remote job starts slow. Need to investigate further before determining how to increase rate.

bernstei commented 2 years ago

Looks like the staging in of files and ssh qsub each take a non-negligible time (around 1s). Both would need to be batched to fully help.

gabor1 commented 2 years ago

is this really an issue? I guess you are already batching individual configs, so it won't be the case that you'd want to qsub 10,000 individual jobs (many queueing systems would choke as well)

bernstei commented 2 years ago

It is when you have 1000 jobs (one per config to re-evaluate an entire fitting database with tighter DFT params), and each one take 3 seconds, because the rsync to stage in fils take 1.5 s and the ssh to qsub takes 1.5 s. I guess I could set chunksize=1 and job_chunksize > 1 to do job_chunksize DFT evaluation per job, and reduce the number of rsync/ssh+qsub by a factor of job_chunksize. Maybe that's the right approach.

bernstei commented 2 years ago

I have a solution for this, where ExPyRe, system, and scheduler can all be told to store information in a buffer, and then start all the jobs in buffer at once (one ssh to set up the directories, one rsync to stage in the run dirs, and one ssh to submit all the jobs). A PR will be available eventually - it'd be useful if people tested the SGE implementation, which I do not have access to.