lucidrains / flamingo-pytorch

Implementation of 🦩 Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorch
MIT License
1.19k stars 59 forks source link

wrong attention masks? #6

Closed dhansmair closed 2 years ago

dhansmair commented 2 years ago

https://github.com/lucidrains/flamingo-pytorch/blob/44920f4191ba3c280ff84c6ebc76025656d1dab5/flamingo_pytorch/flamingo_pytorch.py#L159

In the flamingo paper, the language features in the gated cross-attention layers only attend to the visual features from the immediate preceding image. I believe your attention masks are created in such a way that they attend to the visual features from all preceding images. Can you confirm? If so, a fix would be to simply change the '>=' to '=='.

lucidrains commented 2 years ago

oh no way! can you screenshot and share the relevant passage? wouldn't it make sense to see all preceding images before answering a question?

lucidrains commented 2 years ago

@dhansmair when in doubt, just make it a hyperparameter

https://github.com/lucidrains/flamingo-pytorch/commit/2f3606a30fdf6e6449841b403b59d5e1649f416b

let me know if that works for you!

sharifza commented 2 years ago

@lucidrains I think @dhansmair is right. The previous images are all observed later in the language model, but for conditioning only the immediate previous image is attended to. Here is from the paper: Screenshot_20220608-205638

lucidrains commented 2 years ago

@dhansmair @sharifza indeed! i've defaulted it to only attending to the last media item then! thank you both :pray: