Open bernstei opened 2 years ago
Note - it's unclear, in retrospect, what makes these remote job starts slow. Need to investigate further before determining how to increase rate.
Looks like the staging in of files and ssh qsub each take a non-negligible time (around 1s). Both would need to be batched to fully help.
is this really an issue? I guess you are already batching individual configs, so it won't be the case that you'd want to qsub 10,000 individual jobs (many queueing systems would choke as well)
It is when you have 1000 jobs (one per config to re-evaluate an entire fitting database with tighter DFT params), and each one take 3 seconds, because the rsync to stage in fils take 1.5 s and the ssh to qsub takes 1.5 s. I guess I could set chunksize=1
and job_chunksize > 1
to do job_chunksize
DFT evaluation per job, and reduce the number of rsync/ssh+qsub by a factor of job_chunksize
. Maybe that's the right approach.
I have a solution for this, where ExPyRe, system, and scheduler can all be told to store information in a buffer, and then start all the jobs in buffer at once (one ssh to set up the directories, one rsync to stage in the run dirs, and one ssh to submit all the jobs). A PR will be available eventually - it'd be useful if people tested the SGE implementation, which I do not have access to.
On some remote machines just the ssh connection is somewhat slow. It would be nice if multiple job start commands could be combined, perhaps by gathering all the remote commands into an array of strings, and then running all of them in a single ssh connection.