jharwell / sierra

Automation framework for the scientific method in AI research
MIT License
18 stars 1 forks source link

feature/115-incremental-sim-progress-option #115

Closed jharwell closed 5 years ago

jharwell commented 5 years ago

It would be nice (and will probably be necessary on MSI in the future) to be able to resume running GNU parallel after it has been killed/stopped by the scheduler because it did not finish everything. Not in the case when # allocated nodes = # simulations to run, but in the case when # allocated nodes < # simulations to run, and GNU parallel steps through the commands file incrementally. In this case, being able to resume a previously "suspended"/killed simulation run will be very valuable.

And necessary if I'm going to be getting tiny bits of time/nodes on the new 128-core cluster! Definitely not going be able to request 50 nodes simultaneously on there...