mosaicml / llm-foundry

LLM training code for Databricks foundation models
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
Apache License 2.0
3.84k stars 503 forks source link

Conversion Sharded -> Monolithic checkpoint #1220

Open pretidav opened 1 month ago

pretidav commented 1 month ago

I was wondering if there was a straightforward way to convert from sharded to monolithic checkpoint for a subsequent conversion to hf format (not a direct conversion sharded -> hf). I've read you can define a monolithic callback saver, however I would like to use some "off training" way, simply reading and writing the checkpoint in the now format.

Thanks for all the answers.

dakinggg commented 1 month ago

Unfortunately we have not written a straightforward script for this. As a workaround, you can launch training with the hf checkpointer callback enabled, for 1 batch, with a very small learning rate, or modify train.py to just call the callback's save checkpoint function directly and not train.