tenstorrent / tt-kmd

Tenstorrent Kernel Module
GNU General Public License v2.0
28 stars 6 forks source link

hwmon bug with N150, aarch64 #25

Open joelsmithTT opened 1 week ago

joelsmithTT commented 1 week ago

It's possible to get the system into a state where hwmon information for a single N150 shows up twice.

$ sensors
wormhole-pci-20400
Adapter: PCI adapter
vcore1:      850.00 mV (max =  +0.95 V)
asic1_temp:   +33.8°C  (high = +75.0°C)
power1:       22.00 W  (max = 100.00 W)
current1:     25.00 A  (max = +240.00 A)

nvme-pci-20100
Adapter: PCI adapter
Composite:    +45.9°C  (low  =  -5.2°C, high = +89.8°C)
                       (crit = +93.8°C)

wormhole-pci-20400
Adapter: PCI adapter
vcore1:      850.00 mV (max =  +0.95 V)
asic1_temp:   +33.8°C  (high = +75.0°C)
power1:       22.00 W  (max = 100.00 W)
current1:     25.00 A  (max = +240.00 A)
$ lspci -d 1e52:401e
0002:04:00.0 Processing accelerators: Tenstorrent Inc Wormhole (rev 01)
$ ls /sys/bus/pci/devices/0002\:04\:00.0/hwmon/
hwmon2  hwmon3

This is on an ARM system, I suspect it has something to do with resetting the card. Will update the ticket with additional information after I've debugged.