pytorch / xla

Enabling PyTorch on XLA Devices (e.g. Google TPU)
https://pytorch.org/xla
Other
2.44k stars 451 forks source link

check failed ShapeUtil::Compatible error for FSDP job #5089

Open anw90 opened 1 year ago

anw90 commented 1 year ago

I recently implemented my own model using torch xla FSDP on GPU, but encountered an error message: "Check failed: ShapeUtil::Compatible".

2023-05-26 17:36:08.508196: F external/org_tensorflow/tensorflow/compiler/xla/service/layout_assignment.cc:157] Check failed: ShapeUtil::Compatible(shape_layout.shape(), instruction->operand(operand_no)->shape()) f32[16384512]{0} is not compatible with f32[131076096]{0} (for operand 0 of instruction %reduce-scatter.2477 = f32[16384512]{0} reduce-scatter(f32[131076096]{0} %add.2472), replica_groups={}, constrain_layout=true, dimensions={0}, to_apply=%AddComputation.67)

I suspect that this issue may be related to the recent update of the tf pin to 03/2023, as explained in https://github.com/pytorch/xla/pull/4840#issuecomment-1515345366. I was able to run it successfully using an older version of torch xla.

Has there been any new discovery about this issue?

thanks.

@JackCaoG @wonjoolee95

JackCaoG commented 1 year ago

This is a known issue, XLA:GPU team is looking into it.

Seventeen17 commented 1 year ago

This is a known issue, XLA:GPU team is looking into it.

Is there any progress in solving this issue? Are there any quick solutions that I can try out?

JackCaoG commented 1 year ago

It is still WIP, let me try to get a ETA. FYI xla:gpu team needs to take a look at the issue and fix it on compiler side, pytorch/xla needs to wait for the next tf pin update to get fix, which is scheduled to happen in mid June but will complete around end of month.

Seventeen17 commented 1 year ago

May I ask has it been successfully fixed? It would be great to know the progress and outcome of this issue. Thank you!

JackCaoG commented 1 year ago

The issue only surfaced on H100 and P4 GPU AFAIK, we did not see the issue on V100 and A100 yet. The fix is still WIP.

vanbasten23 commented 1 year ago

Update: we discovered that the issue also surfaced on V100 and xla:gpu team is investigating this issue internally.

mmseerrttt commented 9 months ago

Is there any progress on this issue?