tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
459 stars 68 forks source link

Hugepage perf warning: Hugepage allocation is not on NumaNode matching TT Device #2614

Closed abhullar-tt closed 1 day ago

abhullar-tt commented 1 year ago

UMD warns us if hugepage is not allocated on the same NumaNode as the TT device (results in decreased Device->Host perf)

See relevant Gitlab issue: https://yyz-gitlab.local.tenstorrent.com/devops/devops/-/issues/209

FYI @DrJessop @TT-billteng @davorchap @tt-rkim @pgkeller

tt-rkim commented 7 months ago

@TT-billteng this is VM perf warning

tapspatel commented 7 months ago

fyi previous cloud hugepage issue: https://github.com/tenstorrent/cloud/issues/1725

tt-rkim commented 2 months ago

@tt-asaigal @TT-billteng isnt' this a umd issue

TT-billteng commented 2 months ago

yes, I hope we get rid of hugepages altogether in the refactor

tt-rkim commented 2 months ago

Should we close this on our side? Or leave open

TT-billteng commented 2 months ago

is the fix to install tenstorrent tools? Until we remove hugepages altogether (if we ever get there)

tt-rkim commented 2 months ago

I guess so ie. encourage syseng to actually release that as a public pacakge

TT-billteng commented 2 months ago

it's not a public package yet?

tt-rkim commented 2 months ago

@warthog9 has tenstorrent-tools made its way into the public aether yet?

warthog9 commented 2 months ago

You mean https://github.com/tenstorrent/tt-system-tools that has a debian package?

That's been a public repo for a month or more

tt-rkim commented 2 months ago

@ttmchiou We should look into whether this debian package can fit all our use cases, including for galaxy. When it first came out, that was my initial concern.

cc: @abhullar-tt - we may have a better way to bring up HPs, but hugepages settings concern you

ttmchiou commented 2 months ago

I've tested it running locally on some GS VMs and TGG machines and it has resolved hugepage installation issues. I haven't seen this perf printout in the TGG workflows yet but I also haven't been looking in detail for those.

We have a MIW issue to integrate into provisioning too https://github.com/tenstorrent-metal/metal-internal-workflows/issues/278

pgkeller commented 1 day ago

is there further action needed to close this at this point?

tt-rkim commented 1 day ago

No, we can close. This is a UMD thing.

We likely will continue seeing this on CI as we OOM error with the amount of RAM we have in CI, but we can't allocate enough per NUMANode on the CI machines.