This is to fully unblock scale testing on both TPUs and CPUs. However, the ways the process ids and the coordinator address are communicated (through local files and distributed R/W via GCS) are not production-ready. Follow-up changes are required to make the recovery fully automatic.
This is to fully unblock scale testing on both TPUs and CPUs. However, the ways the process ids and the coordinator address are communicated (through local files and distributed R/W via GCS) are not production-ready. Follow-up changes are required to make the recovery fully automatic.