Distributed training on TPU unfunctional

Currently, trying to train on multiple cloud TPUs results in the training being stuck at 0%.

JAX Process: 0 / 1
JAX Local Devices: [TpuDevice(id=0, process_index=0, coords=(0,0,0), core_on_chip=0), TpuDevice(id=1, process_index=0, coords=(0,0,0), core_on_chip=1), TpuDevice(id=2, process_index=0, coords=(1,0,0), core_on_chip=0), TpuDevice(id=3, process_index=0, coords=(1,0,0), core_on_chip=1), TpuDevice(id=4, process_index=0, coords=(0,1,0), core_on_chip=0), TpuDevice(id=5, process_index=0, coords=(0,1,0), core_on_chip=1), TpuDevice(id=6, process_index=0, coords=(1,1,0), core_on_chip=0), TpuDevice(id=7, process_index=0, coords=(1,1,0), core_on_chip=1)]
[15:59:22] Epoch: 1
           Training:   0%|                              | 0/937 [00:00<?, ?it/s]

It is highly likely that the issue is in the TF record dataset building pipeline, but I couldn't definitively single out a root cause. Note: Training works just fine on GPU.

muhd-umer / pvt-flax

Distributed training on TPU unfunctional #4