Open marty1885 opened 1 month ago
Hey @marty1885 ,
this is a little strange.
A couple notes, we have recently uplifted unpad and its now ttnn::slice(<input tensor>, start, end)
and if the tensor is a host tensor it will do the same as a host side unpad as we had previously, (if its device tensor it will do a device slice/unpad).
I have a python example here and verified that it works: https://github.com/tenstorrent/tt-metal/blob/eb1d9a9f2f1e10d811a27719486c8a30d5f792d4/tests/ttnn/unit_tests/operations/test_host_slice.py
This shows the underlying unpad/slice API works on host. However if you're still having problems with the latest codebase with ttnn::slice
then it's a C++ API issue . I can then look at that. Do you mind giving ttnn::slice
a shot first and let me know , then I can proceed to debug this with you :)
Hi @tarafdarTT ,
I am aware of the API. unfortunately I cannot always use ttnn::slice
. GGML's tests often asks for non tile-aligned view into a tensor, mostly because test tensors are small. But slice requires both coordinates to be tile aligned. This is my current code.
I suspect there's more to this bug. A lot of OPs is failing with non tile-aligned tensors on my side. But I can't be sure as unpad is used a lot in testing. So far MatMul, hardswich, transpose looks likely be malfunctioning too.
if(dst_size[0] % tt::constants::TILE_WIDTH == 0 && dst_size[1] % tt::constants::TILE_HEIGHT == 0 &&
start[2] % tt::constants::TILE_WIDTH == 0 && start[3] % tt::constants::TILE_HEIGHT == 0) {
res = ttnn::slice(*parent, start, end);
}
else {
// THIS is EXTREMELY SLOW. But it works
tt::tt_metal::Tensor tmp = parent->cpu().to(tt::tt_metal::Layout::ROW_MAJOR).unpad(start, end);
res = ttnn::tilize_with_zero_padding(tmp.to(bufctx->device));
}
If possible, I strongly prefer getting unpad working again. Otherwise I loose a lot of tests.
=====
Edit: Sorry I misread your commend. ttnn::slice
is working normally for me. Only unpad is broken.
Hi, I was messing around. And I noticed tensor.volume()
no longer returns the padded volume. Instead it returns the non-padded one. Could this be related to this issue? So now non tile-aligned strides and sizes are off?
ahh thanks @marty1885 , I'll have a look at it today and let you know !
hey @marty1885 I have a fix. You're correct that the tilization was fishy. When doing unpad (slice) on host it used the non-tilized shape to allocate buffer. This commit is a fix: https://github.com/tenstorrent/tt-metal/commit/381ee8c7a6568b5308d8deb1bd6c0037c2f458f8
I'm in the process of adding your test as a unit test to avoid regression on this and once I have that I can merge the above commit to main.
@tarafdarTT Thanks! The commit removes a few error for me (But now some of the view tests fails with incorrect shape.). Do you know why the change to reporting tilized sizes are made? I need some time to debug and isolate the remaining operator failures. They seem to more or less relate to shape and strides.
Edit: Now I'm very confused. Converting to row major used to always leave the last 2 dimensions to be padded till 32x32. I'm expecting 0s at the end of each row. But now the padding is not present anymore. Is this expected?
Device | INFO | Opening user mode device driver
2024-08-07 01:12:54.546 | INFO | SiliconDriver - Detected 1 PCI device : [0]
Metal | INFO | Initializing device 0. Program cache is NOT enabled
Metal | INFO | Running with 1 cqs
Metal | INFO | AI CLK for device 0 is: 1202 MHz
Metal | INFO | Enabling program cache on device 0
Verif | INFO | Created a random vector of size 100
A:
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 0.197266 0.193359 -0.6875 -0.10791 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.6875 -0.796875 -0.882812 -0.0810547 0.730469 -0.332031 0.202148 -0.710938 0.416016 0.300781 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 -0.574219 -0.996094 -0.632812 0.984375 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.632812 0.234375 -0.390625 0.222656 0.0493164 -0.984375 -0.135742 -0.953125 -0.416016 0.0493164 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.223633 -0.200195 -0.71875 -0.90625 -0.414062 0.945312 -0.265625 -0.53125 -0.0874023 -0.816406 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.570312 0.236328 -0.597656 -0.234375 0.0284424 0.964844 0.18457 -0.0664062 -0.90625 0.71875 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.214844 0.359375 -0.65625 -0.0986328 -0.867188 -0.972656 0.894531 0.882812 0.929688 0.125977 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.613281 -0.228516 -0.390625 -0.964844 -0.800781 -0.535156 0.367188 -0.515625 -0.119629 0.365234 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.753906 0.219727 -0.00964355 0.664062 -0.929688 -0.652344 0.816406 -0.217773 -0.482422 -0.632812 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.324219 0.507812 -0.375 -0.149414 0.0400391 -0.582031 0.0932617 0.134766 -0.628906 -0.933594 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
B:
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 -0.6875 -0.796875 -0.882812 -0.0810547 0.730469 -0.332031 -0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 -0.632812 0.234375 -0.390625 0.222656 0.0493164 -0.984375 0.223633 -0.200195 -0.71875 -0.90625 -0.414062 0.945312 0.570312 0.236328
-0.597656 -0.234375 0.0284424 0.964844 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Metal | INFO | Closing device 0
Metal | INFO | Disabling and clearing program cache on device 0
Device | INFO | Closing user mode device drivers
Quick update. I found the updated unpad is giving out the wrong shape. I've updated my test code and got the following output. Unpad should produce tensor of shape [10, 10, 10, 128]
but I got [10, 10, 32, 128]
. Also it randomly hangs my e150. Likely something lower level is also wrong
int main()
{
device = &ttnn::device::open_device(0);
AutoFormat::SetDefaultDevice(device);
ttnn::enable_program_cache(*device);
tt::tt_metal::detail::EnablePersistentKernelCache();
auto a = make_random_tensor({10, 10, 10, 384});
std::cout << "A shape: " << a.shape() << std::endl;
Shape start(std::vector<uint32_t>{0, 0, 0, 0});
Shape end(std::vector<uint32_t>{9, 9, 9, 127});
auto b = a.cpu().to(tt::tt_metal::Layout::ROW_MAJOR).unpad(start, end);
std::cout << "Expecting B to be: [10, 10, 10, 128]\n";
std::cout << "B shape: " << b.shape() << std::endl;
auto c = ttnn::tilize_with_zero_padding(b.to(device));
std::cout << "Expecting C to be [10, 10, 10[32], 128]\n"
<< "Got: " << c.shape() << "\n";
device->close();
}
Device | INFO | Opening user mode device driver
2024-08-07 01:44:04.494 | INFO | SiliconDriver - Detected 1 PCI device : [0]
Metal | INFO | Initializing device 0. Program cache is NOT enabled
Metal | INFO | Running with 1 cqs
Metal | INFO | AI CLK for device 0 is: 1202 MHz
Metal | INFO | Enabling program cache on device 0
Verif | INFO | Created a random vector of size 384000
A shape: ttnn.Shape([10, 10, 10[32], 384])
Expecting B to be: [10, 10, 10, 128]
B shape: ttnn.Shape([10, 10, 32, 128])
Expecting C to be [10, 10, 10[32], 128]
Got: ttnn.Shape([10, 10, 32, 128])
Metal | INFO | Closing device 0
Metal | INFO | Disabling and clearing program cache on device 0
Device | INFO | Closing user mode device drivers
hmmm this is strange! I will have a look at this further today
@marty1885 I solved it.
My commit is incorrect and we don't need it.
The code is actually working, the only thing funky is your print function dump_first_tile_of_tensor
The volume that is being allocated is not of a full tile, it is using the untilized volume for the unpadded tensor.
The size of a Tile is the same no matter the shape of the tensor (32x32 = 1024)
What we want is
uint32_t volume = 1024; //for volume of a single TILE
std::vector<bfloat16> buf(volume);
This then dumps out with your function:
A:
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 0.197266 0.193359 -0.6875 -0.10791 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.6875 -0.796875 -0.882812 -0.0810547 0.730469 -0.332031 0.202148 -0.710938 0.416016 0.300781 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 -0.574219 -0.996094 -0.632812 0.984375 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.632812 0.234375 -0.390625 0.222656 0.0493164 -0.984375 -0.135742 -0.953125 -0.416016 0.0493164 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.223633 -0.200195 -0.71875 -0.90625 -0.414062 0.945312 -0.265625 -0.53125 -0.0874023 -0.816406 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.570312 0.236328 -0.597656 -0.234375 0.0284424 0.964844 0.18457 -0.0664062 -0.90625 0.71875 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.214844 0.359375 -0.65625 -0.0986328 -0.867188 -0.972656 0.894531 0.882812 0.929688 0.125977 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.613281 -0.228516 -0.390625 -0.964844 -0.800781 -0.535156 0.367188 -0.515625 -0.119629 0.365234 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.753906 0.219727 -0.00964355 0.664062 -0.929688 -0.652344 0.816406 -0.217773 -0.482422 -0.632812 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.324219 0.507812 -0.375 -0.149414 0.0400391 -0.582031 0.0932617 0.134766 -0.628906 -0.933594 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
B:
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.597656 -0.234375 0.0284424 0.964844 0.730469 -0.332031 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.632812 0.234375 -0.390625 0.222656 0.0493164 -0.984375 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.223633 -0.200195 -0.71875 -0.90625 -0.414062 0.945312 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.570312 0.236328 -0.597656 -0.234375 0.0284424 0.964844 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Previously buffer was too small and wasn't allocating enough space for the tensor to include the padding.
Here is another example that I did in python of the same thing:
@pytest.mark.parametrize("n", [1])
@pytest.mark.parametrize("c", [1])
@pytest.mark.parametrize("h", [10])
@pytest.mark.parametrize("w", [10])
def test_tensor_unpad_tiled_input(device, n, c, h, w):
torch_input_tensor = torch.rand((n, c, h, w), dtype=torch.bfloat16)
torch_output_tensor = torch_input_tensor[:, :, :6, :6]
activation_pyt_padded_device = ttnn.from_torch(
torch_input_tensor,
dtype=ttnn.DataType.BFLOAT16,
layout=ttnn.ROW_MAJOR_LAYOUT,
device = device
)
activation_pyt_padded_device_tiled = ttnn.tilize_with_zero_padding(activation_pyt_padded_device)
activation_pyt_padded_host_tiled = activation_pyt_padded_device_tiled.cpu()
activation_pyt_padded_host_row_major = activation_pyt_padded_host_tiled.to(ttnn.ROW_MAJOR_LAYOUT)
activation_pyt_out_unpadded = activation_pyt_padded_host_row_major.unpad((0, 0, 0, 0), (n - 1, c - 1, 5, 5))
activation_pyt_padded_out = ttnn.to_torch(activation_pyt_out_unpadded)
assert_with_pcc(torch_output_tensor, activation_pyt_padded_out, 0.9999)
Underneath the hood our to_torch
function takes padding and stuff into consideration. You can have a look at that function if you need some of the intricacies around padding.
@tarafdarTT Huh... someone feels wrong. I'll get back to you soon. I think I messed up my environment and executing stuff on my e150 hangs. I need to fix that first.
@tarafdarTT Sorry for the delay. Issue arises when I upgraded to a newer version of TTNN. I can finally get back to this
Unpad is not acting correctly even in your example. We can find the 2nd row does not match up properly.
A:
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 0.197266 0.193359 -0.6875 -0.10791 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.6875 -0.796875 -0.882812 -0.0810547 0.730469 -0.332031 0.202148 -0.710938 0.416016 0.300781 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 -0.574219 -0.996094 -0.632812 0.984375 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
B:
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.597656 -0.234375 0.0284424 0.964844 0.730469 -0.332031 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
It is more clear if we look at column 0.
A:
-0.25
-0.6875
-0.957031
B:
-0.25
-0.597656 <- Differed from tensor A
-0.957031
I checked python an it works correctly
import ttnn
from tt_lib import tensor
import torch
device = ttnn.open_device(0)
shape = (1, 1, 10, 10)
tensor = torch.rand(*shape, dtype=torch.float32)
a = ttnn.from_torch(tensor, dtype=ttnn.bfloat16, layout=ttnn.TILE_LAYOUT, device=device)
b = a.cpu().to(ttnn.ROW_MAJOR_LAYOUT).unpad((0, 0, 0, 0), (0, 0, 5, 5))
print(a.cpu().to(ttnn.ROW_MAJOR_LAYOUT).to_torch())
print(b.to_torch())
Which outputs
Device | INFO | Opening user mode device driver
2024-08-20 03:58:13.478 | INFO | SiliconDriver - Detected 1 PCI device : [0]
Metal | INFO | Initializing device 0. Program cache is NOT enabled
Metal | INFO | AI CLK for device 0 is: 1202 MHz
tensor([[[[0.3184, 0.4648, 0.9023, ..., 0.0000, 0.0000, 0.0000],
[0.2559, 0.8008, 0.3164, ..., 0.0000, 0.0000, 0.0000],
[0.8867, 0.6758, 0.3867, ..., 0.0000, 0.0000, 0.0000],
...,
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000]]]],
dtype=torch.bfloat16)
tensor([[[[3.1836e-01, 4.6484e-01, 9.0234e-01, 5.5469e-01, 5.0391e-01,
2.7344e-01],
[2.5586e-01, 8.0078e-01, 3.1641e-01, 6.5234e-01, 2.9102e-01,
6.0938e-01],
[8.8672e-01, 6.7578e-01, 3.8672e-01, 1.0107e-01, 1.3550e-02,
7.6953e-01],
[6.6797e-01, 3.7598e-02, 1.8457e-01, 3.2031e-01, 8.2422e-01,
2.1172e-04],
[2.1387e-01, 1.8457e-01, 4.2969e-02, 8.0469e-01, 3.0078e-01,
2.8125e-01],
[5.3125e-01, 6.2109e-01, 5.4297e-01, 1.9629e-01, 4.0039e-01,
8.5547e-01]]]], dtype=torch.bfloat16)
Metal | INFO | Closing device 0
Metal | INFO | Disabling and clearing program cache on device 0
Device | INFO | Closing user mode device drivers
With some more digging. I found that it might be related to the to(device) + memcpy
combo or one of the conversion ops. it works correctly if I use the storage buffer.
#include <cstddef>
#include <ttnn/operations/eltwise/unary/unary.hpp>
#include <ttnn/operations/eltwise/ternary/where.hpp>
#include <ttnn/device.hpp>
#include <ttnn/operations/data_movement/tilize_with_val_padding/tilize_with_val_padding.hpp>
#include "common/bfloat16.hpp"
#include "tt_dnn/op_library/auto_format.hpp"
#include "ttnn/operations/eltwise/unary/unary_composite.hpp"
#include "ttnn/tensor/tensor.hpp"
#include <tt_metal/detail/persistent_kernel_cache.hpp>
#include "ttnn/tensor/tensor.hpp"
#include "ttnn/tensor/types.hpp"
#include <vector>
#include <iostream>
ttnn::device::Device* device = nullptr;
static tt::tt_metal::Tensor make_random_tensor(tt::tt_metal::Shape s)
{
static int seed = 42;
auto b = tt::tt_metal::owned_buffer::create(
create_random_vector_of_bfloat16_native(
s[0] * s[1] * s[2] * s[3] * 2
, 2, seed++, -1));
tt::tt_metal::Tensor t(OwnedStorage{std::move(b)}, s
, tt::tt_metal::DataType::BFLOAT16, tt::tt_metal::Layout::ROW_MAJOR);
return ttnn::tilize_with_zero_padding(t.to(AutoFormat::GetDefaultDevice()));
}
void dump_first_tile_of_tensor(tt::tt_metal::Tensor tensor)
{
std::cout << "dump_first_tile_of_tensor" << std::endl;
assert(tensor.dtype() == tt::tt_metal::DataType::BFLOAT16);
auto t = tensor;
if(t.storage_type() == tt::tt_metal::StorageType::DEVICE) {
std::cout << "To CPU " << std::endl;
t = t.cpu();
}
if(t.layout() != tt::tt_metal::Layout::ROW_MAJOR) {
std::cout << "To ROW" << std::endl;
t = t.to(tt::tt_metal::Layout::ROW_MAJOR);
}
// This fails. Having issues on the 2nd row
// std::cout << "Copy to device" << std::endl;
// t = t.to(AutoFormat::GetDefaultDevice());
// std::vector<bfloat16> buf(1024);
// memcpy(buf.data(), t);
// for(int y = 0; y < 32; y++) {
// for(int x = 0; x < 32; x++) {
// std::cout << buf[y*32+x].to_float() << " ";
// }
// std::cout << "\n";
// }
// std::cout << "\n";
// This works, however
auto storage = std::get<tt::tt_metal::OwnedStorage>(t.storage());
auto buf = std::get<tt::tt_metal::owned_buffer::Buffer<bfloat16>>(storage.get_buffer());
auto ps = t.shape().with_tile_padding();
for(int y = 0; y < ps[2]; y++) {
for(int x = 0; x < ps[3]; x++) {
std::cout << buf[y * ps[3]+x].to_float() << " ";
}
std::cout << "\n";
}
std::cout << "\n";
}
int main()
{
device = &ttnn::device::open_device(0);
AutoFormat::SetDefaultDevice(device);
ttnn::enable_program_cache(*device);
tt::tt_metal::detail::EnablePersistentKernelCache();
auto a = make_random_tensor({1, 1, 10, 10});
Shape start(std::vector<uint32_t>{0, 0, 0, 0});
Shape end(std::vector<uint32_t>{0, 0, 5, 5});
auto b = a.cpu().to(tt::tt_metal::Layout::ROW_MAJOR).unpad(start, end);
std::cout << "A:\n";
dump_first_tile_of_tensor(a);
std::cout << "B:\n";
dump_first_tile_of_tensor(b);
device->close();
}
With output:
Device | INFO | Opening user mode device driver
2024-08-20 04:19:19.493 | INFO | SiliconDriver - Detected 1 PCI device : [0]
Metal | INFO | Initializing device 0. Program cache is NOT enabled
Metal | INFO | AI CLK for device 0 is: 1202 MHz
Metal | INFO | Enabling program cache on device 0
A:
dump_first_tile_of_tensor
To CPU
To ROW
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594 0.197266 0.193359 -0.6875 -0.10791 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.6875 -0.796875 -0.882812 -0.0810547 0.730469 -0.332031 0.202148 -0.710938 0.416016 0.300781 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.957031 -0.886719 0.9375 0.443359 0.664062 0.875 -0.574219 -0.996094 -0.632812 0.984375 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.632812 0.234375 -0.390625 0.222656 0.0493164 -0.984375 -0.135742 -0.953125 -0.416016 0.0493164 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.223633 -0.200195 -0.71875 -0.90625 -0.414062 0.945312 -0.265625 -0.53125 -0.0874023 -0.816406 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.570312 0.236328 -0.597656 -0.234375 0.0284424 0.964844 0.18457 -0.0664062 -0.90625 0.71875 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.214844 0.359375 -0.65625 -0.0986328 -0.867188 -0.972656 0.894531 0.882812 0.929688 0.125977 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.613281 -0.228516 -0.390625 -0.964844 -0.800781 -0.535156 0.367188 -0.515625 -0.119629 0.365234 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
-0.753906 0.219727 -0.00964355 0.664062 -0.929688 -0.652344 0.816406 -0.217773 -0.482422 -0.632812 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0.324219 0.507812 -0.375 -0.149414 0.0400391 -0.582031 0.0932617 0.134766 -0.628906 -0.933594 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
B:
dump_first_tile_of_tensor
-0.25 0.589844 0.898438 -0.632812 0.462891 0.558594
-0.6875 -0.796875 -0.882812 -0.0810547 0.730469 -0.332031
-0.957031 -0.886719 0.9375 0.443359 0.664062 0.875
-0.632812 0.234375 -0.390625 0.222656 0.0493164 -0.984375
0.223633 -0.200195 -0.71875 -0.90625 -0.414062 0.945312
0.570312 0.236328 -0.597656 -0.234375 0.0284424 0.964844
Metal | INFO | Closing device 0
Metal | INFO | Disabling and clearing program cache on device 0
Device | INFO | Closing user mode device drivers
I'm on commit 046237fd9c24f51fefd05f66c270c78e606eae85
What we want is uint32_t volume = 1024; //for volume of a single TILE std::vector
buf(volume);
I see, thanks! I assumed padded volume is always a multiple of 1024 as that's the tile size. Thanks for point it out.
@marty1885, I do not believe that we have a bug here. The memcpy seems to work as expected.
Take a look specifically at the unit test I added on ttnn-11082-add-test
called test_unpad.cpp.
After the memcpy notice the use of device_width so that when the tensor is in row major we are not incorrectly using 32 but rather 6.
const auto shape = t.get_shape();
const auto dim = shape.rank();
const auto width = shape[-1];
const auto height = shape[-2];
const auto device_width = t.get_legacy_shape()[-1];
Describe the bug
tt::tt_metal::Tensor::unpad
Should unpad the tensor on CPU. However I found that it results in partial garbage data in commit 5aa33ef359453c04e519c9ffa29c4ac5815c4fc9. I am using this function as a fallback in my GGML backend to view into tensors.To Reproduce Steps to reproduce the behavior:
Output
Expected behavior
The result tensor should be correct and contains no garbage.
Screenshots If applicable, add screenshots to help explain your problem.
Please complete the following environment information:
Additional context I did another run without manually masking out the padded part of the tensor, thinking maybe that's a part of the bug. I got the following