Open zzhhjjj opened 4 months ago
Ring attention for training on long sequences. Similar to Megatron context parallel. Idea from https://github.com/zhuzilin/ring-flash-attention
Ring attention for training on long sequences. Similar to Megatron context parallel. Idea from https://github.com/zhuzilin/ring-flash-attention