tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
471 stars 75 forks source link

[Bug Report] copy_tile compute api doesn't work properly when idst is from 7 to 15. #3268

Closed dongjin-na closed 11 months ago

dongjin-na commented 1 year ago

Describe the bug Even though the API document states that idst argument in copy_tile can be ess than the size of the DST register (16), it doesn't seem to copy tile ​​to index properly from 7 to 15.

To Reproduce You can reproduce the problem by slightly modifying the test code running on WH b0. This patch make the compute kernel to use dst register from 8 - 11.

diff --git a/tests/tt_metal/tt_metal/test_matmul_multi_core_multi_dram.cpp b/tests/tt_metal/tt_metal/test_matmul_multi_core_multi_dram.cpp
index ca14dd194..3a69e51f2 100644
--- a/tests/tt_metal/tt_metal/test_matmul_multi_core_multi_dram.cpp
+++ b/tests/tt_metal/tt_metal/test_matmul_multi_core_multi_dram.cpp
@@ -331,12 +331,12 @@ int main(int argc, char **argv) {
         int device_id = 0;
         tt_metal::Device *device = tt_metal::CreateDevice(device_id);

-        int num_cores_r = device->logical_grid_size().y - 1;
+        int num_cores_r = device->logical_grid_size().y - 2;
         int num_cores_c = device->logical_grid_size().x;
         uint32_t M = 16 * num_cores_r;
         uint32_t K = 16 * 12;
         uint32_t N = 16 * num_cores_c;
-        int out_subblock_h = 4;
+        int out_subblock_h = 2;
         int out_subblock_w = 2;
         int in0_block_w = 2;
         int per_core_M = M / num_cores_r;
diff --git a/tt_metal/kernels/compute/matmul_large_block_zm.cpp b/tt_metal/kernels/compute/matmul_large_block_zm.cpp
index c1fd470c2..c60897812 100644
--- a/tt_metal/kernels/compute/matmul_large_block_zm.cpp
+++ b/tt_metal/kernels/compute/matmul_large_block_zm.cpp
@@ -35,6 +35,7 @@ void MAIN {
         cb_wait_front(tt::CB::c_in0, in0_block_num_tiles);
         cb_wait_front(tt::CB::c_in1, in1_block_num_tiles);
         int in0_index_subblock_offset = 0;
+        int dst_offset = 8; // if you set it from 0 to 4, then test code will be passed.
         for (uint32_t in0_subblock = 0; in0_subblock < in0_num_subblocks; in0_subblock++) {
             // kernel_profiler::mark_time(6);
             int in1_index_subblock_offset = 0;
@@ -46,14 +47,14 @@ void MAIN {
                     copy_tile_to_dst_init_short();
                     cb_wait_front(tt::CB::c_intermed0, out_subblock_num_tiles);
                     for (uint32_t i = 0; i < out_subblock_num_tiles; i++) {
-                        copy_tile(tt::CB::c_intermed0, i, i);
+                        copy_tile(tt::CB::c_intermed0, i, i + dst_offset);
                     }
                     cb_pop_front(tt::CB::c_intermed0, out_subblock_num_tiles);
                     mm_init_short();
                 }

                 // Compute output sub-block from in0_subblock x in1_subblock
-                int dst_index = 0;
+                int dst_index = dst_offset;
                 int in0_index_h_offset = 0;
                 for (uint32_t h = 0; h < out_subblock_h; h++) {
                     for (uint32_t w = 0; w < out_subblock_w; w++) {
@@ -73,14 +74,14 @@ void MAIN {
                     // Pack out to output buffer
                     cb_reserve_back(tt::CB::c_out0, out_subblock_num_tiles);
                     for (uint32_t i = 0; i < out_subblock_num_tiles; i++) {
-                        pack_tile(i, tt::CB::c_out0);
+                        pack_tile(i + dst_offset, tt::CB::c_out0);
                     }
                     cb_push_back(tt::CB::c_out0, out_subblock_num_tiles);
                 } else {
                     // Move partial result to interm buffer
                     cb_reserve_back(tt::CB::c_intermed0, out_subblock_num_tiles);
                     for (uint32_t i = 0; i < out_subblock_num_tiles; i++) {
-                        pack_tile(i, tt::CB::c_intermed0);
+                        pack_tile(i + dst_offset, tt::CB::c_intermed0);
                     }
                     cb_push_back(tt::CB::c_intermed0, out_subblock_num_tiles);

Expected behavior

Screenshots

ubuntu@tt-metal-dev-moreh-wh-9:~/tt-metal/build/test/tt_metal$ ./test_matmul_multi_core_multi_dram
                  Metal | INFO     | Initializing device 0
                 Device | INFO     | Opening device driver
CHECKING: [0, 0, 0, 0]
CHECKING: [1, 0, 0, 0]
2023-10-19 17:47:47.375 | INFO     | SiliconDriver   - Detected 1 PCI device
2023-10-19 17:47:47.399 | WARNING  | SiliconDriver   - init_detect_tt_device_numanodes(): Could not determine NumaNodeSet for TT device (physical_device_id: 0 pci_bus_id: 0000:00:08.0)
2023-10-19 17:47:47.399 | WARNING  | SiliconDriver   - Could not find NumaNodeSet for TT Device (physical_device_id: 0 pci_bus_id: 0000:00:08.0)
2023-10-19 17:47:47.399 | INFO     | SiliconDriver   - Using 1 Hugepages/NumHostMemChannels for TTDevice (pci_interface_id: 0 device_id: 0x401e revision: 1)
2023-10-19 17:47:47.400 | WARNING  | SiliconDriver   - bind_area_memory_nodeset(): Unable to determine TT Device to NumaNode mapping for physical_device_id: 0. Skipping membind.
---- ttSiliconDevice::init_hugepage: bind_area_to_memory_nodeset() failed (physical_device_id: 0 ch: 0). Hugepage allocation is not on NumaNode matching TT Device. Side-Effect is decreased Device->Host perf (Issue #893).
2023-10-19 17:47:47.431 | INFO     | SiliconDriver   - Disable PCIE DMA
                  Metal | INFO     | AI CLK for device 0 is:   1000 MHz
              LLRuntime | INFO     | Watcher attached device 0
                   Test | INFO     | M = 96, N = 128, K = 192
                   Test | INFO     | Activation = 3072x6144
                   Test | INFO     | Weights = 6144x4096
                   Test | INFO     | Activation block = 16x2, #blocks = 96, #sub-blocks = 8
                   Test | INFO     | Weights block = 2x16, #blocks = 96, #sub-blocks = 8
                  Verif | INFO     | Created identity matrix of size 6144x4096: 25165824
                   Test | INFO     | Scattering inputs (activation & weights) to dram channels using tiled layout
                   Test | INFO     | Copying inputs to dram complete
                   Test | INFO     | Writing kernel runtime args to device
                   Test | INFO     | Writing kernel runtime args to device complete
                   Test | INFO     | Running Matmul 48 core test
                   Test | INFO     | Matmul test done
                   Test | INFO     | Gathering data back from dram and checking against golden
                   Test | INFO     | Golden check complete
                  Metal | INFO     | Closing device 0
                   Test | FATAL    | Test Failed

Please complete the following environment information:

Additional context Considering that the existing code such bmm op only uses dst registers up to 8 - SUBBLOCK_HW_CHOICES array in bmm_op.hpp - , it seems like this issue is already known. Or, if there are any documents or guides I missed, I would appreciate it if you could let me know.

jvasilje commented 1 year ago

fyi @davorchap

chekangliang commented 12 months ago

@davorchap, can you please triage/prioritize this issue?

davorchap commented 11 months ago

We talked to about this a while ago with Moreh during a f2f. Moreh said to come back on the exact use case. If 8 tiles registers are used (half dst mode, valid indicies are 0-7), which means we can't index into 8-15. We would need full dst mode for this.

fyi @chekangliang

jliangTT commented 11 months ago

thanks for the reminder. Right - will assign this @razorback3 to await Moreh feedback.

jliangTT commented 11 months ago

Also, this might not be a bug as this is outside of supported behavior. Will remove it the bug tag and replace with feature request tag.

razorback3 commented 11 months ago

Closing the issue because it is not actually needed right now.