Clarification on lower precisions and questions about speed

tomatopuree commented 1 year ago

Do the 16 and 8 bit precisions available on hugging face load the full precision model in with cut weights, or is the training done in low precision modes?

If the models trained with full precision are indeed loaded in with half or quarter precision, what kind of performance hit does the validation performance take? Is there a comparison of this alongside the graph in the paper that shows losses for different model sizes?

And finally, if we want to run the 120b or 30b version at a reasonable speed, is there any suggested method other than spinning up a p4dn instance on AWS?

Thank you.

mkardas commented 1 year ago

All of the Galactica checkpoints on HuggingFace hub are in float16. The model was trained in bfloat16 with optimizer state in float32. From our experiments (not included in the paper) the difference between float16 checkpoints and float32 checkpoints was negligible. We didn't experiment with 8-bit mode at all.

For 30b inference g5.12xlarge works well with tensor parallelizm in float16 (load_model(..., dtype=torch.float16, parallelize=True)). Another option might be to use offloading.

tomatopuree commented 1 year ago

Thank you very much for your elucidation.

paperswithcode / galai

Clarification on lower precisions and questions about speed #58