Open weifengpy opened 1 month ago
cc @vkuzo
without debugging, my guess would be something like:
I can take a look next week, unless someone gets to it faster
checkpoint is loaded with device cuda, but it does not have the extra buffers for delayed scaling
if running the repo for the 1st time, torchtitan/output/checkpoint folder will be empty. the model won't load checkponits but the error is still there. We do meta init and call init_weights to move model from meta to cuda. buffers for delayed scaling might need some treatment
I can take a look next week, unless someone gets to it faster
thanks!
I see, then this line is relevant: https://github.com/pytorch/ao/blob/e85c1a318b06bbdb3b8c7f92f3257999864446b0/torchao/float8/float8_linear.py#L648
We'll have to think if we can figure out to do this automatically without introducing one more API. If not, we'll have to design such as API.
I see, then this line is relevant: https://github.com/pytorch/ao/blob/e85c1a318b06bbdb3b8c7f92f3257999864446b0/torchao/float8/float8_linear.py#L648
We'll have to think if we can figure out to do this automatically without introducing one more API. If not, we'll have to design such as API.
I see. it sounds plausible
I really don't love this solution, but we could do something like this: https://github.com/pytorch/ao/pull/1292. Thoughts?
I really don't love this solution, but we could do something like this: pytorch/ao#1292. Thoughts?
thanks for the fix!
opening as the fix isn't landed yet :)
repro: