tenstorrent / tt-mlir

Tenstorrent MLIR compiler
https://tenstorrent.github.io/tt-mlir/
Apache License 2.0
72 stars 9 forks source link

grid_size.x gets corrupted in system descriptor serialization to disk #43

Closed ttssokorac closed 4 months ago

ttssokorac commented 4 months ago

On a wormhole system, ttrt query --system-desc returns correct info (truncated):

  "system_desc": {
    "chip_descs": [
      {
        "arch": "Wormhole_b0",
        "grid_size": {
          "y": 9,
          "x": 8
        }
      }
    ],

But, when the system desc is saved to a file, and re-read, the grid_size.x is corrupted:

> ttrt query --save-system-desc n150.sys
  Detecting chips (found 1)                                                                                                                                                                               
system desc saved to: n150.sys

> ttrt read n150.sys 
{
 ....
  "system_desc": {
    "chip_descs": [
      {
        "arch": "Wormhole_b0",
        "grid_size": {
          "y": 9,
          "x": 32727
        }
      }
    ],

The rest of the fields seem correct.

nsmithtt commented 4 months ago

Was just chatting with @mtopalovicTT and this issue came up, this is almost certainly because of Flatbuffer::store in binary.cpp which uses the prefix size to store the flatbuffer, but this doesn't include the size of the prefix itself. I think it's missing a + sizeof(flatbuffers::uoffset_t).

This would correspond to the last bytes just being truncated, which I guess is where grid.x happens to live.

mtopalovicTT commented 4 months ago

We should use GetSizePrefixedBufferLength instead of GetPrefixedSize during store. GetPrefixedSize returns size of the flat buffer which doesn't include size prefix

nsmithtt commented 4 months ago

Great find! Yes, let's use that API

sdjordjevicTT commented 4 months ago

Yes, I verified that using GetSizePrefixedBufferLength instead of GetPrefixedSize fixes the thing. I will send out PR shortly to fix this.