lucidrains / recurrent-interface-network-pytorch

Implementation of Recurrent Interface Network (RIN), for highly efficient generation of images and video without cascading networks, in Pytorch
MIT License
195 stars 14 forks source link

follow up paper #22

Open lucidrains opened 2 months ago

lucidrains commented 2 months ago

recently ran into a researcher who told me there is a follow up paper to this work

does anyone know of it?

StevenLiuWen commented 2 months ago

It could be this one https://arxiv.org/pdf/2405.20324 (Nicolas Dufour et. al, CVPR 2024) which has extended the RIN into text condition.

lucidrains commented 2 months ago

@StevenLiuWen very cool! and not the original author(s)!

StevenLiuWen commented 2 months ago

@StevenLiuWen very cool! and not the original author(s)!

Also, another work, PointInfinity (https://arxiv.org/pdf/2404.03566) applied it to the 3D point cloud generation. RIN or perceiver-io style architecture has a nice property for handling high-resolution data. Looking forward to their more potential applications.

lucidrains commented 2 months ago

indeed, thank you!

justinlovelace commented 2 months ago

This paper is a direct extension from one of the authors (Ting Chen):

FIT: Far-reaching Interleaved Transformers Ting Chen, Lala Li https://arxiv.org/abs/2305.12689

Only skimmed it, but it looks like they just add local self-attention layers to the data branch of RIN. A bit hard to interpret their diffusion results because they only report MSE. It seems reasonable that local self-attention over the pixels would help though.

lucidrains commented 2 months ago

@justinlovelace that's an interesting paper too! 🙏

Xynonners commented 2 months ago

This paper is a direct extension from one of the authors (Ting Chen):

FIT: Far-reaching Interleaved Transformers Ting Chen, Lala Li https://arxiv.org/abs/2305.12689

Only skimmed it, but it looks like they just add local self-attention layers to the data branch of RIN. A bit hard to interpret their diffusion results because they only report MSE. It seems reasonable that local self-attention over the pixels would help though.

probably using a NATTEN on the data branch would work even better

kushalj001 commented 1 month ago

Has anyone tried this purely on text? I am currently working to adapt it for text, so it would vaguely be a "diffusion language model" but wanted to know if there are any similar works or negative results from folks who have tried it already (cc: @justinlovelace would be interested in your thoughts/opinions).