New/replace attention mechanism

Hi,

I was wondering about instead of the gateloop or decoder layer, that they can be replaced by either MoE or ring attention.

Using a MoE will allow the token generator-part to dedicate experts to parts of the mesh, I've noticed that some tokens are majority being used at the start of the sequence. If there's different experts they can be more or less responsibile for these tokens and will reduce the complexity of the model. The idea is kinda the same with ring attention.

I've tried implementing both but it they are throwing cuda errors, might be a machine error or incompetence on my part.

I could implement these changes and tests to see if there's any improvements, although I'll need some guidence. No copy paste is required but an idea on how to implement these do I don't go wandering down in a rabbit hole.

lucidrains / meshgpt-pytorch

New/replace attention mechanism #79