Closed Sudharsan-V closed 6 hours ago
cc @boris-drazic , @jvasilje
@bbradelTT, The changes mentioned above were made in the ttnn/cpp/ttnn/operations/matmul/matmul.hpp
file to address the Cannot mix single and multi-device tensors when calling launch op!
issue.
Could you please confirm if these changes are appropriate and expected on the main branch?
@Sudharsan-V we explicitly removed register_operation_with_auto_launch_op and do not want to go back to it. Also, what you are trying to do is not supported directly.
You would need to use CCL to run matmul across multiple devices.
@SeanNijjar would be able to provide more information about CCL.
cc @TT-BrianLiu
@Sudharsan-V For CCL you can also see https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/CCL/CclDeveloperGuide.md
I want to confirm the issue here.
If the model is purely data parallel across some number of devices, then we shouldn't need any explicit CCLs on the model or op side because for data parallel we don't need to move data between the different model instances running across the chips - they should all be able to run independently (except for perhaps splitting and merging inputs/outputs respectively).
To support this use case, @cfjchu has been working on the multi-device tensor infrastructure. I'd recommend checking out the doc found here: https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/Programming%20Mesh%20of%20Devices/Programming%20Mesh%20of%20Devices%20with%20TT-NN.md
More specifically, from a quick glance it looks like https://github.com/tenstorrent/tt-metal/blob/main/tech_reports/Programming%20Mesh%20of%20Devices/Programming%20Mesh%20of%20Devices%20with%20TT-NN.md#3-distributing-tensor-to-meshdevice has the relevant information.
I am not an expert in the usage of this API for data parallel use cases but I suspect something is wrong with how the input is being fed to the model (does tokenizer map to a mesh tensor?)
@SeanNijjar is right. There shouldn't be any need for explicit CCL operations for data-parallel. @Sudharsan-V Please review relevant docs in multi-device tensor infra. Make sure all inputs to the model use a mesh_mapper / i.e. shard activations and replicate any model parameters
The issue has been resolved.
The problem was that one of the input tensors, position_ids
, had a batch size = 1 (regardless of the input batch size). Since we were slicing along axis 0, this tensor wasn't sharded across devices, causing the issue.
The position_ids tensor has now been updated to have a batch size greater than 1, which resolves the problem.
As the issue has been resolved, closing the ticket.
Thank you @bbradelTT , @SeanNijjar , @cfjchu
We are working on Data Parallel implementation of DistilBERT model taking reference from (#9507) We encountered the issue:
Cannot mix single and multi-device tensors when calling launch op!
. We could find a workaroud with the following changes inttnn/cpp/ttnn/operations/matmul/matmul.hpp
Changed:To:
However, the implementation might be incomplete as we couldn't get complete output. With a batch size of 8, the output shape from the
ttnn_distilbert
model was(4, 384
) instead of the expected(8, 384)
. We checked the PCC betweenttnn_output
andtorch_output[:4, :]
and found it to be greater than 0.99, indicating that the output from the second device is missing.By analysing the perf sheet here, we conclude that all ops are not in data parallel. Most of the ops aren't running on both the devices.
We are working to resolve these issues, and any suggestions or guidance would be help us with data parallel implementation. current branch is here
Distilbert_data_parallel_n300_2024_09_13_13_30_25.csv