Closed abhullar-tt closed 1 day ago
@TT-billteng this is VM perf warning
fyi previous cloud hugepage issue: https://github.com/tenstorrent/cloud/issues/1725
@tt-asaigal @TT-billteng isnt' this a umd issue
yes, I hope we get rid of hugepages altogether in the refactor
Should we close this on our side? Or leave open
is the fix to install tenstorrent tools? Until we remove hugepages altogether (if we ever get there)
I guess so ie. encourage syseng to actually release that as a public pacakge
it's not a public package yet?
@warthog9 has tenstorrent-tools
made its way into the public aether yet?
You mean https://github.com/tenstorrent/tt-system-tools that has a debian package?
That's been a public repo for a month or more
@ttmchiou We should look into whether this debian package can fit all our use cases, including for galaxy. When it first came out, that was my initial concern.
cc: @abhullar-tt - we may have a better way to bring up HPs, but hugepages settings concern you
I've tested it running locally on some GS VMs and TGG machines and it has resolved hugepage installation issues. I haven't seen this perf printout in the TGG workflows yet but I also haven't been looking in detail for those.
We have a MIW issue to integrate into provisioning too https://github.com/tenstorrent-metal/metal-internal-workflows/issues/278
is there further action needed to close this at this point?
No, we can close. This is a UMD thing.
We likely will continue seeing this on CI as we OOM error with the amount of RAM we have in CI, but we can't allocate enough per NUMANode on the CI machines.
UMD warns us if hugepage is not allocated on the same NumaNode as the TT device (results in decreased Device->Host perf)
See relevant Gitlab issue: https://yyz-gitlab.local.tenstorrent.com/devops/devops/-/issues/209
FYI @DrJessop @TT-billteng @davorchap @tt-rkim @pgkeller