tenstorrent / tt-mlir

Tenstorrent MLIR compiler
https://tenstorrent.github.io/tt-mlir/
Apache License 2.0
51 stars 7 forks source link

Upgrade metal 516a8917fac17649abbc9052b680894b432b4de6 #456

Closed nsmithtt closed 2 weeks ago

nsmithtt commented 3 weeks ago

Regressions:

nsmithtt commented 3 weeks ago

@pjanevskiTT, with this upgrade we now are tracking metal main

pjanevskiTT commented 3 weeks ago

This looks great, thanks Nick. Few things I wanted to ask/mention

It looks like metal builds with this change as well, so maybe we can change it in order to support building it inside tt-mlir as well. If it makes sense, I can make the change on tt-metal side

Otherwise, this PR looks great. I think development and integration is going to be very smooth this way, thanks once again

nsmithtt commented 3 weeks ago

This looks great, thanks Nick. Few things I wanted to ask/mention

  • When I want to build standalone metal inside tt-mlir, I need to set ENABLE_LIBCXX to OFF in order to use libstd. Is there some smart way to do this? Only way I found is basically doing set(ENABLE_LIBCXX OFF) inside CMakeLists.txt in tt-metal repo.

When you invoke cmake from the cmake command line you can do -DENABLE_LIBCXX=OFF.

  • Even when I force ENABLE_LIBCXX to off, build fails. Required change to make the build work is changing this line to using type = std::nullptr_t;

Hmm, this feels like a compiler version issue. Our CI is able to build with it disabled, I think we are using clang-17 though. If there is a simple fix on the metal side to support your compiler version I'd go ahead and create a PR against metal with commit message #0: Build error supports clang-.....

It looks like metal builds with this change as well, so maybe we can change it in order to support building it inside tt-mlir as well. If it makes sense, I can make the change on tt-metal side

Yes so metal builds with this change, left default enabled. But if you look at this PR, we set -DENABLE_LIBCXX=OFF.

pilkicTT commented 3 weeks ago

@nsmithtt can you confirm if everything works on silicon? I've been trying something out on forge side (on this branch of mlir), and am hitting some issues when creating sys desc (n300 machine)... concretely, when deviceMesh.close_devices() is called in getCurrentSystemDesc(), everything hangs...

nsmithtt commented 3 weeks ago

@nsmithtt can you confirm if everything works on silicon? I've been trying something out on forge side (on this branch of mlir), and am hitting some issues when creating sys desc (n300 machine)... concretely, when deviceMesh.close_devices() is called in getCurrentSystemDesc(), everything hangs...

Hmm, so mine isn't hanging, but the simple_sum test does error out with OOM. I'm going to hold off on merging this until we have silicon CI in place which I think is close to landing. I don't want to destabilize.

nsmithtt commented 2 weeks ago

FYI all, going to try and uplift now that we have silicon CI.