Closed lessw2020 closed 2 months ago
Looking at the error stack:
[rank0]:[rank0]: File "/home/less/local/miniconda3/envs/newserver/lib/python3.10/site-packages/torch/distributed/_tensor/ops/_embedding_ops.py", line 119, in _reduce_value
[rank0]:[rank0]: assert self.mask_buffer.data is not None
[rank0]:[rank0]: AssertionError
This seems to be from the TP package. Regardless of the error itself, this seems a bit intrusive.
assert self.mask_buffer.data is not None
Is this assert complaining that the weight (of embedding) is not loaded?
I've found that if we set the activations we pass in for creating the pipelineStage are the same as the 'future' dtype we will run with the model, then I am able to run the model with real weights in the proper dtype. (in this case fp16, so we have to pass in fp16 sample activations and then things work as expected. Passing in fp32 and then running a half dtype is where things break with this error).
I've further updated things to now store the checkpoint dtype in the internal lookup table and then we adjust accordingly. Thus things now work as expected and at the same time the user does not have to provide/care about this...it just works. This is ultimately a limitation of the built in parallelism, but we've now got it handled with an automated solution so closing this out.
🐛 Describe the bug
Using our prototype parallel blocks for built in distributed, we can run tp + pp in fp32 successfully. However, moving the model to bfloat16 or fp32 results in an embedding assert:
This issue is to track the debugging and resolution.
Versions
N/A