DistilBERT: Issue with Multi-Device Tensor Handling in Data Parallel Implementation

Sudharsan-V commented 6 days ago

We are working on Data Parallel implementation of DistilBERT model taking reference from (#9507) We encountered the issue: Cannot mix single and multi-device tensors when calling launch op!. We could find a workaroud with the following changes in ttnn/cpp/ttnn/operations/matmul/matmul.hpp Changed:

constexpr auto matmul = ttnn::register_operation<"ttnn::matmul", operations::matmul::MatmulOperation>();
constexpr auto linear = ttnn::register_operation<"ttnn::linear", operations::matmul::LinearOperation>();

To:

constexpr auto matmul = ttnn::register_operation_with_auto_launch_op<"ttnn::matmul", operations::matmul::MatmulOperation>();
constexpr auto linear = ttnn::register_operation_with_auto_launch_op<"ttnn::linear", operations::matmul::LinearOperation>();

Is this change valid? These changes allowed us to run the Data Parallel implementation for the DistilBERT model on n300.

However, the implementation might be incomplete as we couldn't get complete output. With a batch size of 8, the output shape from the ttnn_distilbert model was (4, 384) instead of the expected (8, 384). We checked the PCC between ttnn_output and torch_output[:4, :] and found it to be greater than 0.99, indicating that the output from the second device is missing.

By analysing the perf sheet here, we conclude that all ops are not in data parallel. Most of the ops aren't running on both the devices.

We are working to resolve these issues, and any suggestions or guidance would be help us with data parallel implementation. current branch is here

Distilbert_data_parallel_n300_2024_09_13_13_30_25.csv

Sudharsan-V commented 4 days ago

cc @boris-drazic , @jvasilje

Sudharsan-V commented 2 days ago

@bbradelTT, The changes mentioned above were made in the ttnn/cpp/ttnn/operations/matmul/matmul.hpp file to address the Cannot mix single and multi-device tensors when calling launch op! issue. Could you please confirm if these changes are appropriate and expected on the main branch?

bbradelTT commented 2 days ago

@Sudharsan-V we explicitly removed register_operation_with_auto_launch_op and do not want to go back to it. Also, what you are trying to do is not supported directly.

You would need to use CCL to run matmul across multiple devices.

@SeanNijjar would be able to provide more information about CCL.

cc @TT-BrianLiu

bbradelTT commented 2 days ago

@Sudharsan-V For CCL you can also see https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/CCL/CclDeveloperGuide.md

SeanNijjar commented 2 days ago

I want to confirm the issue here.

If the model is purely data parallel across some number of devices, then we shouldn't need any explicit CCLs on the model or op side because for data parallel we don't need to move data between the different model instances running across the chips - they should all be able to run independently (except for perhaps splitting and merging inputs/outputs respectively).

To support this use case, @cfjchu has been working on the multi-device tensor infrastructure. I'd recommend checking out the doc found here: https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/Programming%20Mesh%20of%20Devices/Programming%20Mesh%20of%20Devices%20with%20TT-NN.md

More specifically, from a quick glance it looks like https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/Programming%20Mesh%20of%20Devices/Programming%20Mesh%20of%20Devices%20with%20TT-NN.md#3-distributing-tensor-to-meshdevice has the relevant information.

I am not an expert in the usage of this API for data parallel use cases but I suspect something is wrong with how the input is being fed to the model (does tokenizer map to a mesh tensor?)

cfjchu commented 1 day ago

@SeanNijjar is right. There shouldn't be any need for explicit CCL operations for data-parallel. @Sudharsan-V Please review relevant docs in multi-device tensor infra. Make sure all inputs to the model use a mesh_mapper / i.e. shard activations and replicate any model parameters

Sudharsan-V commented 6 hours ago

The issue has been resolved. The problem was that one of the input tensors, position_ids, had a batch size = 1 (regardless of the input batch size). Since we were slicing along axis 0, this tensor wasn't sharded across devices, causing the issue. The position_ids tensor has now been updated to have a batch size greater than 1, which resolves the problem. As the issue has been resolved, closing the ticket.

Thank you @bbradelTT , @SeanNijjar , @cfjchu

tenstorrent / tt-metal

DistilBERT: Issue with Multi-Device Tensor Handling in Data Parallel Implementation #12640