Open uphoffc opened 6 years ago
You shouldn't have any problem running multiple independent tuning runs on multiple machines. At least, when I've done it, I automatically end up with one database file per machine, so there's no contention. But they won't share information.
If instead you want to run a single tuning run that runs trials on multiple machines, you should run OpenTuner on some other machine (doesn't have to be powerful because it won't be used for trials). Then implement compile
to, besides doing whatever compilation you need for the configuration being tested, submit the job to the cluster, wait, and record the job time/fitness as the compilation result. Then your run_precompiled
implementation just returns the fitness.
This is a hack around OpenTuner's assumption that compilation/preparation can run in parallel, but trials must run serially to avoid disturbing each other. If you pass --batch-size N
, OpenTuner will "compile" N trials in parallel then run them sequentially before asking the search techniques for more configurations. The mario example uses a similar hack: trials run in separate, single-threaded emulator instances, so we can run as many of them in parallel as we have cores; we don't try to use multiple machines, but we could. (This is using Python threads so compile
doesn't actually execute in parallel, but if you're launching and waiting for external processes, those processes can execute in parallel.)
Thanks for the quick and detailed answer.
Should --batch-size above be --parallelism ?
Hi,
I'm currently considering to replace our custom auto-tuning implementation with OpenTuner. I was wondering if it is possible (or should be possible) to run OpenTuner simultaneously on multiple compute nodes, where all nodes share the same database (i.e. the database lies on a shared GPFS or Lustre file system). While there is of course the issue that nodes may run at slightly different speeds, the resulting parallelism leads to a faster search space exploration. Do you have any experience with this or do you know of any technical limitation that would prohibit this?
Best regards, Carsten