Fix _RowLinearAsyncCommunication

As discussed in #46, the implementation of row parallel linear backward pass seems wrong, when TP mode is reduce-scatter and async communication is enabled. To be more specific, the gradient of input tensor is produced by

compute local slice of input gradient
allgather slice of input gradient

This will produce the same input gradient for all TP shards, which is not the correct behavior of Row Parallel Linear layer. Also, the check of input gradient for row parallel is also missing in test cases. After adding the test, the result for test_tensor_parallel.py::test_row_linear[True-TensorParallelLinearMode.REDUCE_SCATTER-4-1-1] is as follows: Snipaste_2024-05-16_18-37-08 Only 1/4 of the input gradient values are correct, because that's the locally computed part.

This PR does the following:

Fixed the backward pass of _RowLinearAsyncCommunication, in a similar way to forward pass of _ColumnLinearAsyncCommunication that overlaps communication with part of computation.
Add the missing test that checks the correctness of input gradient in _RowLinearAsyncCommunication

This bug is related to convergence and can be triggered when TP mode is reduce-scatter, and async communication is enabled (which is a common setup for users).

Please fix this, thanks!

huggingface / nanotron

Fix _RowLinearAsyncCommunication #172