tenstorrent / tt-metal

:metal: TT-NN operator library, and TT-Metalium low level kernel programming model.
Apache License 2.0
399 stars 50 forks source link

WH: slow dispatch post commit corrupts board FW (PCIe / ARC) #2665

Closed davorchap closed 11 months ago

davorchap commented 1 year ago

@muthutt observed this so he can provide more details on: machine, branch, hash commit. (see below)

The test reports this error: Read 0xffffffff from ARC scratch[6]: you should reset the board. -- However, the board can't be reset (see below).

image

uthu@e04cs05:~/tt-metal$ tt-smi -wr wait all
Caught Exception Read 0xffffffff from ARC scratch[6]: you should reset the board. when trying to initialize device on pci:0; Continuing without device...
⠦ Detecting Tenstorrent devices...
⠋ Failed to initialize device on pci:0
No chips detected, exiting

machine: ssh muthu@172.27.28.130 branch: main on at SHAID 632910be197253ac8f48d47b042b4e6a22b1ea0b (https://github.com/tenstorrent-metal/tt-metal/commit/632910be197253ac8f48d47b042b4e6a22b1ea0b)

davorchap commented 1 year ago

@pgkeller and @jvasilje , looks like we're corrupting ARC and/or PCIe -- most likely from the device kernel by writing illegally to PCIe / ARC core(s). Can watcher prevent kernels from reading/writing to those cores entirely (in slow dispatch)? Perhaps we need to apply the same to ETH cores for now (no read/writes to ETH cores for now)

Other than corrupting ARC / PCIe not sure what else can get the board into this state.

davorchap commented 1 year ago

I'm escalating this one P0 (show stopper) as this basically means slow dispatch post commit is "bricking" the board (ie board can't be reset), and we can't even run the slow dispatch. Hopefully we can isolate which test is doing this and hopefully the watcher can identify the offending transaction.

fyi @abhullar-tt @TT-billteng @tt-rkim

tt-rkim commented 1 year ago

Do we freeze main for this?

mo-tenstorrent commented 1 year ago

TT-smi log on 172.27.28.130

TT-SMI - Version 2023-08-22-492ad2b9ef82a243
Sep 15 2023 02:28:00.531354 PM

Host Information
  * OS                       : Linux
  * Distro                   : Ubuntu 20.04.4 LTS
  * Kernel                   : 5.4.0-162-generic
  * Hostname                 : e04cs05
  * Platform                 : x86_64
  * Python                   : 3.7.16
  * Memory                   : 503.74 GB
  * Driver                   : TTKMD 1.23

Board Information
  * Device 0:
    - Bus ID                 : e1:00.0
    - Family                 : NEBULA_X2
    - Board ID               : 0100014211703009 L
    - Chip Coords            : (0, 0, 0, 0)
    - Chip Revision          : B0
    - DRAM Speed             : 12G
    - DRAM Trained           : Y
    - AICLK PPM En           : Y
    - Link Speed             : Gen4 / Gen4
    - Link Width             : x16 / x16

  * Device 1:
    - Bus ID                 : e1:00.0
    - Family                 : NEBULA_X2
    - Board ID               : 0100014211703009 R
    - Chip Coords            : (1, 0, 0, 0)
    - Chip Revision          : B0
    - DRAM Speed             : 12G
    - DRAM Trained           : Y
    - AICLK PPM En           : Y
    - Link Speed             : Gen4 / Gen4
    - Link Width             : x16 / x16

FW Information
  * Device 0:
    - FW Version             : 15.0.0
    - FW Date                : 2023-08-08
    - ETH FW Version         : 6.3.0
    - M3 BL Version          : 129.2.0.0
    - M3 App Version         : 5.6.0.0
    - TT-Flash Version       : 7.13.0.0

  * Device 1:
    - FW Version             : 15.0.0
    - FW Date                : 2023-08-08
    - ETH FW Version         : 6.3.0
    - M3 BL Version          : 129.2.0.0
    - M3 App Version         : 5.6.0.0
    - TT-Flash Version       : 7.13.0.0

Device Telemetry
  * Device 0:
    - Core Voltage (V)       : 0.72 / 0.95
    - AICLK (MHz)            : 500 / 1000
    - ARCCLK (MHz)           : 540
    - AXICLK (MHz)           : 900
    - Core Current (A)       : 10.0 / 160
    - Core Power (W)         : 6.0 /  85
    - Core Temp (°C)         : 43.2 / 75
    - VREG Temp (°C)         : 39.0
    - Inlet Temp (°C)        : 37.0
    - Outlet Temp 1 (°C)     : 36.0
    - Outlet Temp 2 (°C)     : 36.0

  * Device 1:
    - Core Voltage (V)       : 0.72 / 0.95
    - AICLK (MHz)            : 500 / 1000
    - ARCCLK (MHz)           : 540
    - AXICLK (MHz)           : 900
    - Core Current (A)       : 10.0 / 160
    - Core Power (W)         : 7.0 /  85
    - Core Temp (°C)         : 36.3 / 75
    - VREG Temp (°C)         : 38.0
    - Inlet Temp (°C)        : 37.0
    - Outlet Temp 1 (°C)     : 36.0
    - Outlet Temp 2 (°C)     : 36.0

Checklist
  * Host OS                  : Pass
  * Host Driver              : Pass
  * Host Memory              : Pass
  * Devices                  : Pass
  * Device PCIE              : Pass
  * Device DRAM              : Pass
pgkeller commented 1 year ago

looks like we're corrupting ARC and/or PCIe -- most likely from the device kernel by writing illegally to PCIe / ARC core(s).

What constitutes an illegal write to PCIe/ARC? I don't know how we even write to ARC. For PCIe, watcher checks for a valid address, though right now that is set to 4G which means all PCIe transactions are passed through. We need an API t query the hugepage size, then I can write that to the generated header to narrow the check.

abhullar-tt commented 1 year ago

Is the next action item to re-run the WH post commit with watcher enabled?

pgkeller commented 1 year ago

Watcher seems to work great running a single test, I don't have confidence it'll work running a suite of tests. I plan to debug this as next on my list, but I'll be out most of next week so not sure if I'll get it resolved before then

davorchap commented 1 year ago

I wasn't able to repro this.

@muthutt , @kkwong10 , @abhullar-tt , @DrJessop and @tt-rkim have any of you repro'd this w/ new UMD + reset at init -- in the last couple of days?

if not we should close the case.

mo-tenstorrent commented 1 year ago

Just did a fresh clean build off of latest on main and ran the slow-dispatch test after a full hard reboot and still see a hang on the first device access running:

TT_METAL_SLOW_DISPATCH_MODE=1 ./tests/scripts/run_pre_post_commit_regressions_slow_dispatch.sh

Screenshot 2023-09-19 at 9 40 12 AM

tt-smi reset is successful after the hang.

Screenshot 2023-09-19 at 9 46 55 AM
tt-rkim commented 1 year ago

@mo-tenstorrent @davorchap Can confirm this is machine has been upgraded to latest driver + 7.D required FW for UMD.

Could you copy paste your env vars and command history for this test? Please delete anything sensitive in the env vars.

mo-tenstorrent commented 1 year ago

Yeah here are my env and history

history:

  392  sudo reboot
  393  tmux
  394  w
  395  cd tt-metal/
  396  git fetch
  397  git reset --hard origin/main
  398  git submodule update --init --recursive
  399  git clean -dfx
  400  export TT_METAL_HOME=$(pwd); export TT_METAL_ENV=dev; export PYTHONPATH=$(pwd); export ARCH_NAME=wormhole_b0
  401  make build
  402  TT_METAL_SLOW_DISPATCH_MODE=1 ./tests/scripts/run_pre_post_commit_regressions_slow_dispatch.sh
  403  source build/python_env/bin/activate
  404  TT_METAL_SLOW_DISPATCH_MODE=1 ./tests/scripts/run_pre_post_commit_regressions_slow_dispatch.sh

env:

$ env
SHELL=/bin/bash
__GIT_PROMPT_SHOW_UNTRACKED_FILES=
PWD=/home/mo/tt-metal
LOGNAME=mo
XDG_SESSION_TYPE=tty
ARCH_NAME=wormhole_b0
MOTD_SHOWN=pam
__GIT_PROMPT_IGNORE_SUBMODULES=1
HOME=/home/mo
LANG=en_US.UTF-8
LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:mi=00:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arc=01;31:*.arj=01;31:*.taz=01;31:*.lha=01;31:*.lz4=01;31:*.lzh=01;31:*.lzma=01;31:*.tlz=01;31:*.txz=01;31:*.tzo=01;31:*.t7z=01;31:*.zip=01;31:*.z=01;31:*.dz=01;31:*.gz=01;31:*.lrz=01;31:*.lz=01;31:*.lzo=01;31:*.xz=01;31:*.zst=01;31:*.tzst=01;31:*.bz2=01;31:*.bz=01;31:*.tbz=01;31:*.tbz2=01;31:*.tz=01;31:*.deb=01;31:*.rpm=01;31:*.jar=01;31:*.war=01;31:*.ear=01;31:*.sar=01;31:*.rar=01;31:*.alz=01;31:*.ace=01;31:*.zoo=01;31:*.cpio=01;31:*.7z=01;31:*.rz=01;31:*.cab=01;31:*.wim=01;31:*.swm=01;31:*.dwm=01;31:*.esd=01;31:*.jpg=01;35:*.jpeg=01;35:*.mjpg=01;35:*.mjpeg=01;35:*.gif=01;35:*.bmp=01;35:*.pbm=01;35:*.pgm=01;35:*.ppm=01;35:*.tga=01;35:*.xbm=01;35:*.xpm=01;35:*.tif=01;35:*.tiff=01;35:*.png=01;35:*.svg=01;35:*.svgz=01;35:*.mng=01;35:*.pcx=01;35:*.mov=01;35:*.mpg=01;35:*.mpeg=01;35:*.m2v=01;35:*.mkv=01;35:*.webm=01;35:*.ogm=01;35:*.mp4=01;35:*.m4v=01;35:*.mp4v=01;35:*.vob=01;35:*.qt=01;35:*.nuv=01;35:*.wmv=01;35:*.asf=01;35:*.rm=01;35:*.rmvb=01;35:*.flc=01;35:*.avi=01;35:*.fli=01;35:*.flv=01;35:*.gl=01;35:*.dl=01;35:*.xcf=01;35:*.xwd=01;35:*.yuv=01;35:*.cgm=01;35:*.emf=01;35:*.ogv=01;35:*.ogx=01;35:*.aac=00;36:*.au=00;36:*.flac=00;36:*.m4a=00;36:*.mid=00;36:*.midi=00;36:*.mka=00;36:*.mp3=00;36:*.mpc=00;36:*.ogg=00;36:*.ra=00;36:*.wav=00;36:*.oga=00;36:*.opus=00;36:*.spx=00;36:*.xspf=00;36:
VIRTUAL_ENV=/home/mo/tt-metal/build/python_env
__GIT_PROMPT_IGNORE_STASH=0
LC_TERMINAL=iTerm2
TT_METAL_ENV=dev
SSH_CONNECTION=10.212.27.52 53615 172.27.28.130 22
LESSCLOSE=/usr/bin/lesspipe %s %s
XDG_SESSION_CLASS=user
PYTHONPATH=/home/mo/tt-metal
TERM=xterm-256color
LESSOPEN=| /usr/bin/lesspipe %s
USER=mo
LC_TERMINAL_VERSION=3.4.20
SHLVL=1
XDG_SESSION_ID=26
__GIT_PROMPT_SHOW_TRACKING=1
XDG_RUNTIME_DIR=/run/user/1016
__GIT_PROMPT_SHOW_UPSTREAM=0
PS1=(\[\033[0;34m\]python_env\[\033[0;0m\]) \[\033[0;31m\]✘-TERM\[\033[0;0m\] \[\033[0;33m\]\w\[\033[0;0m\] [\[\033[0;35m\]${GIT_BRANCH}\[\033[0;0m\]|\[\033[1;32m\]✔\[\033[0;0m\]\[\033[0;0m\]] \n\[\033[0;37m\]$(date +%H:%M)\[\033[0;0m\] $
SSH_CLIENT=10.212.27.52 53615 22
__GIT_PROMPT_SHOW_CHANGED_FILES_COUNT=1
XDG_DATA_DIRS=/usr/local/share:/usr/share:/var/lib/snapd/desktop
PATH=/home/mo/tt-metal/build/python_env/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/mo/.fzf/bin
TT_METAL_HOME=/home/mo/tt-metal
DBUS_SESSION_BUS_ADDRESS=unix:path=/run/user/1016/bus
SSH_TTY=/dev/pts/0
GIT_BRANCH=main
__GIT_PROMPT_WITH_USERNAME_AND_REPO=0
OLDPWD=/home/mo
_=/usr/bin/env
mo-tenstorrent commented 1 year ago

Line 400 has all the envs I set

mo-tenstorrent commented 1 year ago

Just as a quick sanity check, is it worth having cloud run a pybuda test on this machine? Maybe this is a HW issue

tt-rkim commented 1 year ago

AFAIK, ~this~ hanging has not been an issue before. I've only seen PCC issues. I have run WH tests on this machine before but was getting pad PCC errors.

Did you want to try re-flashing the board and rebooting?

davorchap commented 1 year ago

Is this just this specific machine? Hanging right away hasn't been observed on other machines.

mo-tenstorrent commented 1 year ago

Am I missing a step? Can someone with more WH experience give this a shot as well?

I did the exact same steps on .73 WH machine yesterday and it worked with no issues.

kkwong10 commented 1 year ago

Side note, which machine are you using? I got an IRD machine today which uses an X2 and has firmware from Aug-29 and that seems to be hanging right away for me. @tt-rkim is aug-29 the 7.E firmware?

mo-tenstorrent commented 1 year ago

This is the same machine for the ticket 172.27.28.130

kkwong10 commented 1 year ago

Hm then your configuration matches the working machine configuration I see...

davorchap commented 1 year ago

Driver 1.23 and FW 7.D is the one I've been running on SC lab machine

Unpad multi core was hanging on the 4th iteration of post commit Haven't seen hang right away on SC lab machines. Also tried on DirtBox with 1.23 and 7.D, was fine a week ago

tt-rkim commented 1 year ago

I will try a force flash and reboot and see if that helps. Then re-run the tests

Can confirm hanging tests as Mo has said on this machine.

abhullar-tt commented 1 year ago

I believe there are some wormhole tests in UMD, could we try running those on this machine?

davorchap commented 1 year ago

I believe there are some wormhole tests in UMD, could we try running those on this machine?

@tt-rkim that's a great idea, we should run UMD tests as a first job in the post commit workflow.

TT-billteng commented 1 year ago

@tt-rkim we should use our CI runners for GH UMD mirror as well; gitlab runners don't have hardware attached so it would be good to at least have one side regress on actual hardware

tt-rkim commented 1 year ago

Do you think our CI runners will be sufficient for UMD? They should have everything they need.

If so, we can open up an issue and start working github workflows for UMD right away.

davorchap commented 11 months ago

@tt-rkim @TT-billteng @kkwong10 @jvasilje -- we haven't seen this again / repro'd? So it was a machine one-time thing?

tt-rkim commented 11 months ago

I was not able to repro this error.

However, I don't believe I ever was able to properly run post commit for WH on this machine.

TT-billteng commented 11 months ago

somewhat related, I noticed that pytest has extra plugins; one plugin I tried is a random shuffle plugin

we should run stress tests (gtest, pytest) with shuffle commands/plug-ins