Open esmalTT opened 2 months ago
Copying a thread from slack. Evan was testing remote vs. mmio readback performance:
import time
import pytest
import ttnn
import torch
@pytest.mark.parametrize("enable_async_mode", (True,False), indirect=True)
@pytest.mark.parametrize("device_params", [{"l1_small_size": 16384}], indirect=True)
@pytest.mark.parametrize("layout", [ttnn.TILE_LAYOUT])
@pytest.mark.parametrize("device_id", [0, 1])
def test_transfer(mesh_device, layout, device_id, use_program_cache, enable_async_mode):
device = mesh_device.get_devices()[device_id]
sharded = True
H = 337920
expected = torch.rand([1, 1, H, 1])
input_tensor = ttnn.from_torch(expected, dtype=ttnn.bfloat16)
input_tensor = ttnn.to_layout(input_tensor, layout)
sharded_memory_config = ttnn.create_sharded_memory_config(
[1, 1, H, 32], ttnn.CoreGrid(x=8, y=8), ttnn.ShardStrategy.HEIGHT
)
input_tensor = ttnn.to_device(input_tensor, device, sharded_memory_config if sharded else ttnn.L1_MEMORY_CONFIG)
# Warmup
x = input_tensor
x = x.cpu(blocking=False)
ttnn.synchronize_devices(device)
outputs = []
iterations = 32 * 32
start = time.time()
for _ in range(iterations):
x = input_tensor
outputs.append(x.cpu(blocking=False))
ttnn.synchronize_devices(device)
end = time.time()
total_time = end - start
print(f"time: {1000.0 * total_time:.2f} ms")
print(f"avg time: {1000.0 * total_time / iterations:.2f} ms")
elem_size = 2
num_elem = H * 32
num_devices = 1
num_bytes = num_elem * elem_size * num_devices * iterations
transfer_speed = num_bytes / total_time
print(f"transfer speed: {transfer_speed * 1e-9:.2f} gb/s")
Latency:
For example, when reading from device in a loop:
For a tensor of shape (2, 2048, 32)
Device 0 takes an average of 0.70 ms per transfer
Device 1 takes an average of 0.84 ms per transfer
For a tensor of shape (2, 337920, 32)
Device 0 takes an average of 3.7 ms per transfer
Device 1 takes an average of 11.3 ms per transfer
Bandwidth:
almost 6 GB/s e2e on the chip 0 but only 1.9 GB/s on chip 1
Update: @tt-asaigal has provided a fix (similar to this) that significantly improves end-to-end scaling for multiple devices. This fix removes a bottleneck on host by making sure that worker and completion queue threads are on completely independent cores.
Measuring the new end-to-end perf with this fix shows that performance scales much better when only using MMIO chips:
MMIO Devices | Remote Devices | FPS | FPS per device | FPS per MMIO device | |
---|---|---|---|---|---|
T3K (MMIO-only) | 4 | 0 | 1530 | 382.5 | 382.5 |
T3K | 4 | 4 | 1366 | 170.75 | 341.5 |
N300 | 1 | 1 | 377 | 188.5 | 377 |
Poor scaling on remote devices hints that the main bottleneck is now likely on read/write to the remote chip
Summary
On the current
main
(commit 861fb7ef87bf9c20ee7a4c1632e3852681cc8ef4) - the single chip performance of UNet is approx. 329 fps. Running the same test except data parallel on N300 measures 246 fps. The performance does not change if we disable async. Similarly on T3K, we are only getting 443 fps end-to-end.We should investigate why this is the case.
Steps to reproduce
Using UNet Shallow
Build latest
main
and enable ethernet dispatch cores:export WH_ARCH_YAML=wormhole_b0_80_arch_eth_dispatch.yaml
.Run the following steps:
Using isolated test case
The behaviour is also present in the following test:
Viewing the profiler in tracy shows device 1 being slower than device 0:
Investigation