I found that when the context length is 512k, the training speed is too slow, which is different from your experimental results. It takes 585 seconds for training a batch of 512k , which is to 512000/585.85=873.94 tokens/s
And I used A100-80G*8 with NVLINK.
I found that when the context length is 512k, the training speed is too slow, which is different from your experimental results. It takes 585 seconds for training a batch of 512k , which is to 512000/585.85=873.94 tokens/s And I used A100-80G*8 with NVLINK.