This is to track/investigate the issue reported by Rich Zhu, where using permute to generate a transposed tensor for nn.linear, results in an incorrect aten.expand call.
I've found two potential issues -
a - the original dtensor placement spec is incorrect relative to the device mesh. (placement is [Shard(dim=0)], even though device mesh is 2 gpus, or [1,2] shape.
(note, this issue was not detected when Rich ran it, though this was run inside a testing environ).
b - correcting this by updating the placement to [Shard(dim=0), Shard(dim=0)] in view ops, reshape prop to match the mesh,
then results in the target_schema.args_schema[1:] in dispatch.py suggesting it do an expansion of a [2,40,10] into a [1,40,10] which fails when the operation is called.
RuntimeError: The expanded size of the tensor (1) must match the existing size (2) at non-singleton dimension 0. Target sizes: [1, 40, 10]. Tensor sizes: [2, 40, 10]
The permute operation and weight tensor replication all seem to take place with no issue.
1 - Running the above using 2 gpus results in the following error:
view ops 506:
in_shard=[Shard(dim=0)], mesh_sizes=(1, 2), local_in_shape=(2, 10, 40)
Traceback (most recent call last):
File "/home/ubuntu/feature_fusion/spmd/tensor/dispatch.py", line 222, in propagate_input_sharding
output_sharding = sharding_prop_func(op_schema)
File "/home/ubuntu/feature_fusion/spmd/tensor/ops/view_ops.py", line 665, in reshape_prop
) = propagate_shape_and_sharding(
File "/home/ubuntu/feature_fusion/spmd/tensor/ops/view_ops.py", line 508, in propagate_shape_and_sharding
assert len(in_shard) == len(mesh_sizes)
AssertionError
To resolve this, I added a patch to upgrade the placements sequence to
input dtensor spec [Shard(dim=0), Shard(dim=0)]
2 - With the placement updated (possibly incorrectly, as the root issue may be why it's not accounting for the the proper placement), we move to the expansion issue.
The weights are replicated, and the input tensor is properly permuted from 2,10,40 to 2, 40,10.
and then the aten. expand is called, resulting in the error.
Thus several questions:
1 - why the placement at least for me is incorrect
2 - once the placement is modified, why is the expansion being done to expand in reverse the tensor, and hence the issue.
Thanks @lessw2020 for the investigation. I looked into it further, and it turns out that the issue is:
As input to SPMD, we're passing a tensor that is supposed to be interpreted as "sharded". This input tensor is then passed to aot_module. aot_module captures a graph that bakes in some of the shaeps. Specifically , expand(shape=[2,10,80]) is baked in to the graph. But this graph is supopsed to be the "global graph" -- these shapes are supposed to be global shapes that are going to be split again at parallelization.
SO when we do shape_propagation, we expect for [2,80,10] to be the global shape, when in reality it's the sharded shape (because we're using from_local).
It's confusing and I will try to expplain better tomorrow but i'm devising a hack with a fix.
This is to track/investigate the issue reported by Rich Zhu, where using permute to generate a transposed tensor for nn.linear, results in an incorrect aten.expand call.
I've found two potential issues - a - the original dtensor placement spec is incorrect relative to the device mesh. (placement is [Shard(dim=0)], even though device mesh is 2 gpus, or [1,2] shape.
(note, this issue was not detected when Rich ran it, though this was run inside a testing environ).
b - correcting this by updating the placement to [Shard(dim=0), Shard(dim=0)] in view ops, reshape prop to match the mesh, then results in the target_schema.args_schema[1:] in dispatch.py suggesting it do an expansion of a [2,40,10] into a [1,40,10] which fails when the operation is called. RuntimeError: The expanded size of the tensor (1) must match the existing size (2) at non-singleton dimension 0. Target sizes: [1, 40, 10]. Tensor sizes: [2, 40, 10]
The permute operation and weight tensor replication all seem to take place with no issue.
Repro steps: Simple model:
1 - Running the above using 2 gpus results in the following error:
To resolve this, I added a patch to upgrade the placements sequence to
2 - With the placement updated (possibly incorrectly, as the root issue may be why it's not accounting for the the proper placement), we move to the expansion issue. The weights are replicated, and the input tensor is properly permuted from 2,10,40 to 2, 40,10.
We then get to the expansion issue:
and then :
in dispatch.py:
Ultimately, the expansion to the half batch size comes from: dispatch.py, 222:
as we have the op_schema changing silently from:
to:
and then the aten. expand is called, resulting in the error.
Thus several questions: 1 - why the placement at least for me is incorrect 2 - once the placement is modified, why is the expansion being done to expand in reverse the tensor, and hence the issue.