Thoughts on full3d optimization

Thank you for your thoughtful feedback. Regarding the application of our methods to mmDiT architecture:

Window attention: You're correct that this may not be as effective in mmDiT due to its different attention structure. We think that preserving all the attention for text tokens is likely more appropriate in this case.
Attention outputs sharing across timesteps: Maybe some certain of sharing is also OK. We haven't conducted experiments on how it might impact this aspect. This is certainly an interesting direction for future exploration.
Sharing Residual: We believe these could still be effective in mmDiT.
We didn't utilize CFG residuals. In our experiments with DiT, we found that direct replacement yielded good results while avoiding increased memory consumption and slight decreases in inference speed. That said, we haven't tested this on other networks, so exploring CFG residuals in different architectures, including mmDiT, could be a valuable avenue for further research.

We appreciate your insights into how our methods might apply to or be adapted for different model architectures. We plan to do these explorations in the next version.

thu-nics / DiTFastAttn

Thoughts on full3d optimization #5