Add support for bias-based attention methods including T5 and ALiBi and updates the serialization lib
This branch builds on the ALiBi branch from @wenshuoliu, and further abstracts it while adding T5 bucketed RA support.
T5 impl was compared against the flaxformer impl and the mesh TF impl.
Note that the T5 impl is currently only callable with bidirectional on (and defaults from paper). This should be fixed in a future PR.
Add support for bias-based attention methods including T5 and ALiBi and updates the serialization lib This branch builds on the ALiBi branch from @wenshuoliu, and further abstracts it while adding T5 bucketed RA support. T5 impl was compared against the flaxformer impl and the mesh TF impl. Note that the T5 impl is currently only callable with bidirectional on (and defaults from paper). This should be fixed in a future PR.