hpcaitech / Open-Sora

Open-Sora: Democratizing Efficient Video Production for All
https://hpcaitech.github.io/Open-Sora/
Apache License 2.0
21.54k stars 2.06k forks source link

The network architecture of Opensora 1.2 has changed #569

Open liang-jinli opened 2 months ago

liang-jinli commented 2 months ago

First of all, we greatly appreciate such outstanding work. We are currently using OpenSora for some experiments but have encountered an issue:

The network architecture of Opensora 1.2 is significantly different from previous versions. According to the report for version 1.0, it used the STDiT (Sequential) structure. However, in version 1.2, it seems to have reverted to the Latte structure. Could you explain why this change was made? Do new experiments show that the Latte structure performs better?

henbucuoshanghai commented 2 months ago

u can change it as yourself?

liang-jinli commented 2 months ago

u can change it as yourself?

Mainly want to know what is the specific reason for doing this?

JThh commented 2 months ago

I did not see many references to using Latte in our latest report. Can you kindly refer me to those? As mentioned in report01.md: As shown in the figure, we insert a temporal attention right after each spatial attention in STDiT (ST stands for spatial-temporal). This is similar to variant 3 in Latte's paper. However, we do not control a similar number of parameters for these variants. While Latte's paper claims their variant is better than variant 3, our experiments on 16x256x256 videos show that with same number of iterations, the performance ranks as: DiT (full) > STDiT (Sequential) > STDiT (Parallel) ≈ Latte. Thus, we choose STDiT (Sequential) out of efficiency. Speed benchmark is provided [here](/docs/acceleration.md#efficient-stdit).

liang-jinli commented 1 month ago

I did not see many references to using Latte in our latest report. Can you kindly refer me to those? As mentioned in report01.md: As shown in the figure, we insert a temporal attention right after each spatial attention in STDiT (ST stands for spatial-temporal). This is similar to variant 3 in Latte's paper. However, we do not control a similar number of parameters for these variants. While Latte's paper claims their variant is better than variant 3, our experiments on 16x256x256 videos show that with same number of iterations, the performance ranks as: DiT (full) > STDiT (Sequential) > STDiT (Parallel) ≈ Latte. Thus, we choose STDiT (Sequential) out of efficiency. Speed benchmark is provided [here](/docs/acceleration.md#efficient-stdit).

It's precisely because the report doesn't mention it that it seems even more strange. In the code implementation, OpenSora1.2 returns to using the structure of Latte's variant 3, but the report doesn't explain the reason for this decision. And also, is there any experiment that can prove that the structure of OpenSora1.2 (i.e., Latte) is better?

GallenShao commented 1 month ago

May be the performance ranks "DiT (full) > STDiT (Sequential) > STDiT (Parallel) ≈ Latte" mentioned in report01 only works when these structures have the same number of blocks. In OpenSora 1.2 (opensora/models/stdit/stdit3.py) and in OpenSora 1.1 (opensora/models/stdit/stdit2.py),the depth is 28 for both versions. This means that OpenSora 1.1 has 28 blocks, each containing two self-attention layers, while OpenSora 1.2 has 28 spatial blocks and 28 temporal blocks. Consequently, the number of learnable parameters in OpenSora 1.2 is significantly greater than in OpenSora 1.1. Splitting self-attention into multiple blocks might contribute to the model's performance and stability. This disparity in parameter count might be why they changed the structure to Latte.

henbucuoshanghai commented 1 month ago

maybe this reason

Edwardmark commented 1 month ago

I also found that.