Currently implementation of SeqParallelMultiHeadCrossAttention will produce different results.
For example, executing python scripts/inference.py configs/opensora-v1-2/inference/sample.py --num-frames 1 --resolution 720p --aspect-ratio 9:16 --prompt "a beautiful waterfall" --verbose 2 produces the following image:
However, running with two GPUs by torchrun --nproc_per_node 2 scripts/inference.py configs/opensora-v1-2/inference/sample.py --num-frames 1 --resolution 720p --aspect-ratio 9:16 --prompt "a beautiful waterfall" --verbose 2 will produce:
While both results look marvelous, it would be better to keep the results consistent among different distributed settings. The reason why they are not consistent is because the tensor Q is not reshaped correctly before conducting all_to_all among different ranks.
If I understand correctly, Q has a shape of [1, (B, SUB_N), NUM_HEADS, HEAD_DIM] before all_to_all, after which we expect Q's shape to be [1, (B, SP, SUB_N), SUB_NUM_HEADS, HEAD_DIM] (where SP denotes the distributed world size). However, all_to_all simply concatentes among the gather dimension. Thus, what we actually get is [1, (SP, B, SUB_N), SUB_NUM_HEADS, HEAD_DIM]. We can fix it either through the proposed changes in this pull request, or conduct an transpose as follows, after which we can observe a consistent result regardless of different distributed settings:
Currently implementation of
SeqParallelMultiHeadCrossAttention
will produce different results.For example, executing
python scripts/inference.py configs/opensora-v1-2/inference/sample.py --num-frames 1 --resolution 720p --aspect-ratio 9:16 --prompt "a beautiful waterfall" --verbose 2
produces the following image:However, running with two GPUs by
torchrun --nproc_per_node 2 scripts/inference.py configs/opensora-v1-2/inference/sample.py --num-frames 1 --resolution 720p --aspect-ratio 9:16 --prompt "a beautiful waterfall" --verbose 2
will produce:While both results look marvelous, it would be better to keep the results consistent among different distributed settings. The reason why they are not consistent is because the tensor Q is not reshaped correctly before conducting
all_to_all
among different ranks.If I understand correctly, Q has a shape of
[1, (B, SUB_N), NUM_HEADS, HEAD_DIM]
beforeall_to_all
, after which we expect Q's shape to be[1, (B, SP, SUB_N), SUB_NUM_HEADS, HEAD_DIM]
(whereSP
denotes the distributed world size). However,all_to_all
simply concatentes among the gather dimension. Thus, what we actually get is[1, (SP, B, SUB_N), SUB_NUM_HEADS, HEAD_DIM]
. We can fix it either through the proposed changes in this pull request, or conduct an transpose as follows, after which we can observe a consistent result regardless of different distributed settings: