Closed tomatopuree closed 1 year ago
All of the Galactica checkpoints on HuggingFace hub are in float16. The model was trained in bfloat16 with optimizer state in float32. From our experiments (not included in the paper) the difference between float16 checkpoints and float32 checkpoints was negligible. We didn't experiment with 8-bit mode at all.
For 30b inference g5.12xlarge
works well with tensor parallelizm in float16 (load_model(..., dtype=torch.float16, parallelize=True)
). Another option might be to use offloading.
Thank you very much for your elucidation.
Do the 16 and 8 bit precisions available on hugging face load the full precision model in with cut weights, or is the training done in low precision modes?
If the models trained with full precision are indeed loaded in with half or quarter precision, what kind of performance hit does the validation performance take? Is there a comparison of this alongside the graph in the paper that shows losses for different model sizes?
And finally, if we want to run the 120b or 30b version at a reasonable speed, is there any suggested method other than spinning up a p4dn instance on AWS?
Thank you.