Very High Ram Usage when resuming training from unsharded recipe_state

I've encountered a small issue while using Torchtune for distributed training across multiple GPUs. The problem occurs when resuming training from an unsharded recipe_state, resulting in extremely high RAM usage (over 900GB).

Current Setup:

Hardware: 10x A6000 GPUs
Training: Distributed across multiple GPUs

When resuming training, the loading of the unsharded recipe_state causes excessive RAM consumption. This high memory usage makes it challenging to resume training efficiently, especially on multi-node setups. I've implemented a temporary fix by adding a time delay for each GPU based on its rank_value. After finishing, these GPUs wait for others to complete. While this mitigates the immediate problem, it's not an optimal long-term solution.

So i propose:

Consider implementing a sharded saving mechanism for the Recipe State, similar to the approach used by Hugging Face. This would help avoid high RAM usage, particularly when training across multiple nodes. Could the Torchtune team please investigate this issue and consider implementing a more memory-efficient method for saving and loading the Recipe State in distributed training scenarios?

pytorch / torchtune

Very High Ram Usage when resuming training from unsharded recipe_state #1318