Applying parallel attn with ff to existing pretrained model?

lucidrains / flamingo-pytorch

Implementation of 🦩 Flamingo, state-of-the-art few-shot visual question answering attention net out of Deepmind, in Pytorch

MIT License

1.19k stars 59 forks source link

Applying parallel attn with ff to existing pretrained model? #12

Open huu4ontocord opened 1 year ago

huu4ontocord commented 1 year ago

Hi - awesome work! I am trying to understand ? I couldn't find a paper - only a reference to https://github.com/kingoflolz/mesh-transformer-jax. Is this right? Am I understanding that it is bascially applying multiple operations of for qkv and ff at once? Is it possible to use this trick to modify an existing pretrained model?

https://github.com/lucidrains/flamingo-pytorch/blob/749f8244794002371913d2fc4e7411afd5eddc67/flamingo_pytorch/flamingo_palm.py#L90

Many thanks in advance!

Huu

lucidrains commented 1 year ago

@ontocord yup that's correct, it was invented by Ben Wang for GPT-J, then subsequently adopted by PaLM