prominence-eosc / imc

PROMINENCE infrastructure provisioner
Apache License 2.0
0 stars 0 forks source link

Handle unreliable clouds for multi-node MPI jobs #16

Closed alahiff closed 5 years ago

alahiff commented 5 years ago

The main OpenStack cloud we have access to can't schedule multiple VMs at the same time (generally have ~100% chance at least one will fail due to OpenStack's scheduling). Need to handle this situation & re-create failed VMs.

alahiff commented 5 years ago

Done & from testing so far it seems to work. We now:

  1. Create basic infrastructure (no Ansible recipies specified)
  2. If any VMs are in the failed state, remove them
  3. Once the remaining infrastructure is in the configured state, add new VMs to replace the failed ones.
  4. Once the infrastructure is in the configured state, reconfigure using the real & final RADL