Closed davorchap closed 11 months ago
@pgkeller and @jvasilje , looks like we're corrupting ARC and/or PCIe -- most likely from the device kernel by writing illegally to PCIe / ARC core(s). Can watcher prevent kernels from reading/writing to those cores entirely (in slow dispatch)? Perhaps we need to apply the same to ETH cores for now (no read/writes to ETH cores for now)
Other than corrupting ARC / PCIe not sure what else can get the board into this state.
I'm escalating this one P0 (show stopper) as this basically means slow dispatch post commit is "bricking" the board (ie board can't be reset), and we can't even run the slow dispatch. Hopefully we can isolate which test is doing this and hopefully the watcher can identify the offending transaction.
fyi @abhullar-tt @TT-billteng @tt-rkim
Do we freeze main for this?
TT-smi log on 172.27.28.130
TT-SMI - Version 2023-08-22-492ad2b9ef82a243
Sep 15 2023 02:28:00.531354 PM
Host Information
* OS : Linux
* Distro : Ubuntu 20.04.4 LTS
* Kernel : 5.4.0-162-generic
* Hostname : e04cs05
* Platform : x86_64
* Python : 3.7.16
* Memory : 503.74 GB
* Driver : TTKMD 1.23
Board Information
* Device 0:
- Bus ID : e1:00.0
- Family : NEBULA_X2
- Board ID : 0100014211703009 L
- Chip Coords : (0, 0, 0, 0)
- Chip Revision : B0
- DRAM Speed : 12G
- DRAM Trained : Y
- AICLK PPM En : Y
- Link Speed : Gen4 / Gen4
- Link Width : x16 / x16
* Device 1:
- Bus ID : e1:00.0
- Family : NEBULA_X2
- Board ID : 0100014211703009 R
- Chip Coords : (1, 0, 0, 0)
- Chip Revision : B0
- DRAM Speed : 12G
- DRAM Trained : Y
- AICLK PPM En : Y
- Link Speed : Gen4 / Gen4
- Link Width : x16 / x16
FW Information
* Device 0:
- FW Version : 15.0.0
- FW Date : 2023-08-08
- ETH FW Version : 6.3.0
- M3 BL Version : 129.2.0.0
- M3 App Version : 5.6.0.0
- TT-Flash Version : 7.13.0.0
* Device 1:
- FW Version : 15.0.0
- FW Date : 2023-08-08
- ETH FW Version : 6.3.0
- M3 BL Version : 129.2.0.0
- M3 App Version : 5.6.0.0
- TT-Flash Version : 7.13.0.0
Device Telemetry
* Device 0:
- Core Voltage (V) : 0.72 / 0.95
- AICLK (MHz) : 500 / 1000
- ARCCLK (MHz) : 540
- AXICLK (MHz) : 900
- Core Current (A) : 10.0 / 160
- Core Power (W) : 6.0 / 85
- Core Temp (°C) : 43.2 / 75
- VREG Temp (°C) : 39.0
- Inlet Temp (°C) : 37.0
- Outlet Temp 1 (°C) : 36.0
- Outlet Temp 2 (°C) : 36.0
* Device 1:
- Core Voltage (V) : 0.72 / 0.95
- AICLK (MHz) : 500 / 1000
- ARCCLK (MHz) : 540
- AXICLK (MHz) : 900
- Core Current (A) : 10.0 / 160
- Core Power (W) : 7.0 / 85
- Core Temp (°C) : 36.3 / 75
- VREG Temp (°C) : 38.0
- Inlet Temp (°C) : 37.0
- Outlet Temp 1 (°C) : 36.0
- Outlet Temp 2 (°C) : 36.0
Checklist
* Host OS : Pass
* Host Driver : Pass
* Host Memory : Pass
* Devices : Pass
* Device PCIE : Pass
* Device DRAM : Pass
looks like we're corrupting ARC and/or PCIe -- most likely from the device kernel by writing illegally to PCIe / ARC core(s).
What constitutes an illegal write to PCIe/ARC? I don't know how we even write to ARC. For PCIe, watcher checks for a valid address, though right now that is set to 4G which means all PCIe transactions are passed through. We need an API t query the hugepage size, then I can write that to the generated header to narrow the check.
Is the next action item to re-run the WH post commit with watcher enabled?
Watcher seems to work great running a single test, I don't have confidence it'll work running a suite of tests. I plan to debug this as next on my list, but I'll be out most of next week so not sure if I'll get it resolved before then
I wasn't able to repro this.
@muthutt , @kkwong10 , @abhullar-tt , @DrJessop and @tt-rkim have any of you repro'd this w/ new UMD + reset at init -- in the last couple of days?
if not we should close the case.
Just did a fresh clean build off of latest on main and ran the slow-dispatch test after a full hard reboot and still see a hang on the first device access running:
TT_METAL_SLOW_DISPATCH_MODE=1 ./tests/scripts/run_pre_post_commit_regressions_slow_dispatch.sh
tt-smi
reset is successful after the hang.
@mo-tenstorrent @davorchap Can confirm this is machine has been upgraded to latest driver + 7.D required FW for UMD.
Could you copy paste your env vars and command history for this test? Please delete anything sensitive in the env vars.
Yeah here are my env
and history
history
:
392 sudo reboot
393 tmux
394 w
395 cd tt-metal/
396 git fetch
397 git reset --hard origin/main
398 git submodule update --init --recursive
399 git clean -dfx
400 export TT_METAL_HOME=$(pwd); export TT_METAL_ENV=dev; export PYTHONPATH=$(pwd); export ARCH_NAME=wormhole_b0
401 make build
402 TT_METAL_SLOW_DISPATCH_MODE=1 ./tests/scripts/run_pre_post_commit_regressions_slow_dispatch.sh
403 source build/python_env/bin/activate
404 TT_METAL_SLOW_DISPATCH_MODE=1 ./tests/scripts/run_pre_post_commit_regressions_slow_dispatch.sh
env
:
$ env
SHELL=/bin/bash
__GIT_PROMPT_SHOW_UNTRACKED_FILES=
PWD=/home/mo/tt-metal
LOGNAME=mo
XDG_SESSION_TYPE=tty
ARCH_NAME=wormhole_b0
MOTD_SHOWN=pam
__GIT_PROMPT_IGNORE_SUBMODULES=1
HOME=/home/mo
LANG=en_US.UTF-8
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
VIRTUAL_ENV=/home/mo/tt-metal/build/python_env
__GIT_PROMPT_IGNORE_STASH=0
LC_TERMINAL=iTerm2
TT_METAL_ENV=dev
SSH_CONNECTION=10.212.27.52 53615 172.27.28.130 22
LESSCLOSE=/usr/bin/lesspipe %s %s
XDG_SESSION_CLASS=user
PYTHONPATH=/home/mo/tt-metal
TERM=xterm-256color
LESSOPEN=| /usr/bin/lesspipe %s
USER=mo
LC_TERMINAL_VERSION=3.4.20
SHLVL=1
XDG_SESSION_ID=26
__GIT_PROMPT_SHOW_TRACKING=1
XDG_RUNTIME_DIR=/run/user/1016
__GIT_PROMPT_SHOW_UPSTREAM=0
PS1=(\[\033[0;34m\]python_env\[\033[0;0m\]) \[\033[0;31m\]✘-TERM\[\033[0;0m\] \[\033[0;33m\]\w\[\033[0;0m\] [\[\033[0;35m\]${GIT_BRANCH}\[\033[0;0m\]|\[\033[1;32m\]✔\[\033[0;0m\]\[\033[0;0m\]] \n\[\033[0;37m\]$(date +%H:%M)\[\033[0;0m\] $
SSH_CLIENT=10.212.27.52 53615 22
__GIT_PROMPT_SHOW_CHANGED_FILES_COUNT=1
XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop
PATH=/home/mo/tt-metal/build/python_env/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/mo/.fzf/bin
TT_METAL_HOME=/home/mo/tt-metal
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1016/bus
SSH_TTY=/dev/pts/0
GIT_BRANCH=main
__GIT_PROMPT_WITH_USERNAME_AND_REPO=0
OLDPWD=/home/mo
_=/usr/bin/env
Line 400
has all the envs I set
Just as a quick sanity check, is it worth having cloud run a pybuda test on this machine? Maybe this is a HW issue
AFAIK, ~this~ hanging has not been an issue before. I've only seen PCC issues. I have run WH tests on this machine before but was getting pad PCC errors.
Did you want to try re-flashing the board and rebooting?
Is this just this specific machine? Hanging right away hasn't been observed on other machines.
Am I missing a step? Can someone with more WH experience give this a shot as well?
I did the exact same steps on .73
WH machine yesterday and it worked with no issues.
Side note, which machine are you using? I got an IRD machine today which uses an X2 and has firmware from Aug-29 and that seems to be hanging right away for me. @tt-rkim is aug-29 the 7.E firmware?
This is the same machine for the ticket 172.27.28.130
Hm then your configuration matches the working machine configuration I see...
Driver 1.23 and FW 7.D is the one I've been running on SC lab machine
Unpad multi core was hanging on the 4th iteration of post commit Haven't seen hang right away on SC lab machines. Also tried on DirtBox with 1.23 and 7.D, was fine a week ago
I will try a force flash and reboot and see if that helps. Then re-run the tests
Can confirm hanging tests as Mo has said on this machine.
I believe there are some wormhole tests in UMD, could we try running those on this machine?
I believe there are some wormhole tests in UMD, could we try running those on this machine?
@tt-rkim that's a great idea, we should run UMD tests as a first job in the post commit workflow.
@tt-rkim we should use our CI runners for GH UMD mirror as well; gitlab runners don't have hardware attached so it would be good to at least have one side regress on actual hardware
Do you think our CI runners will be sufficient for UMD? They should have everything they need.
If so, we can open up an issue and start working github workflows for UMD right away.
@tt-rkim @TT-billteng @kkwong10 @jvasilje -- we haven't seen this again / repro'd? So it was a machine one-time thing?
I was not able to repro this error.
However, I don't believe I ever was able to properly run post commit for WH on this machine.
somewhat related, I noticed that pytest has extra plugins; one plugin I tried is a random shuffle plugin
we should run stress tests (gtest, pytest) with shuffle commands/plug-ins
@muthutt observed this so he can provide more details on: machine, branch, hash commit. (see below)
The test reports this error:
Read 0xffffffff from ARC scratch[6]: you should reset the board.
-- However, the board can't be reset (see below).machine: ssh muthu@172.27.28.130 branch: main on at SHAID 632910be197253ac8f48d47b042b4e6a22b1ea0b (https://github.com/tenstorrent-metal/tt-metal/commit/632910be197253ac8f48d47b042b4e6a22b1ea0b)