Closed dhansmair closed 2 years ago
oh no way! can you screenshot and share the relevant passage? wouldn't it make sense to see all preceding images before answering a question?
@dhansmair when in doubt, just make it a hyperparameter
https://github.com/lucidrains/flamingo-pytorch/commit/2f3606a30fdf6e6449841b403b59d5e1649f416b
let me know if that works for you!
@lucidrains I think @dhansmair is right. The previous images are all observed later in the language model, but for conditioning only the immediate previous image is attended to. Here is from the paper:
@dhansmair @sharifza indeed! i've defaulted it to only attending to the last media item then! thank you both :pray:
https://github.com/lucidrains/flamingo-pytorch/blob/44920f4191ba3c280ff84c6ebc76025656d1dab5/flamingo_pytorch/flamingo_pytorch.py#L159
In the flamingo paper, the language features in the gated cross-attention layers only attend to the visual features from the immediate preceding image. I believe your attention masks are created in such a way that they attend to the visual features from all preceding images. Can you confirm? If so, a fix would be to simply change the '>=' to '=='.