imbue-ai / cluster-health

MIT License
249 stars 32 forks source link

Training Infrastructure Scripts

This repository contains various scripts developed at Imbue for managing a large cluster of H100s, detecting and fixing hardware issues, and generally ensuring smooth model training. You can read more about our process here

The code is organized as follows: