rbonghi / jetson_stats

📊 Simple package for monitoring and control your NVIDIA Jetson [Orin, Xavier, Nano, TX] series
https://rnext.it/jetson_stats
GNU Affero General Public License v3.0
2.17k stars 264 forks source link

jtop becomes unstable when interval is small. #414

Closed leimao closed 10 months ago

leimao commented 1 year ago

Describe the bug

It seems that when interval is small, jetson.ok() can be very likely false.

To Reproduce

from jtop import jtop

with jtop(interval=0.1) as jetson:
    # jetson.ok() will provide the proper update frequency
    while jetson.ok():
        # Read tegra stats
        print(jetson.stats)

Expected behavior

jetson.ok() should be true in most of the scenarios.

Board

$ jetson_release -v
Software part of jetson-stats 4.2.1 - (c) 2023, Raffaello Bonghi
Model: Jetson-AGX - Jetpack 5.1.1 [L4T 35.3.1]
NV Power Mode[0]: MAXN
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:
 - 699-level Part Number: 699-82888-0004-400 L.0
 - P-Number: p2888-0004
 - Module: NVIDIA Jetson AGX Xavier (32 GB ram)
 - SoC: tegra194
 - CUDA Arch BIN: 7.2
 - Codename: Galen
Platform:
 - Machine: aarch64
 - System: Linux
 - Distribution: Ubuntu 20.04 focal
 - Release: 5.10.104-tegra
 - Python: 3.8.10
jtop:
 - Version: 4.2.1
 - Service: Active
Libraries:
 - CUDA: 11.4.315
 - cuDNN: 8.6.0.166
 - TensorRT: 8.5.2.2
 - VPI: 2.2.6
 - Vulkan: 1.3.204
 - OpenCV: 4.5.4 - with CUDA: NO

Log from jtop.service

$ journalctl -u jtop.service -n 100 --no-pager
-- Logs begin at Thu 2023-03-02 04:58:02 PST, end at Tue 2023-05-23 21:43:01 PDT. --
May 23 21:23:51 jetson-agx-xavier systemd[1]: Started jtop service.
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.service - jetson_stats 4.2.1 - server loaded
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.hardware - Hardware detected aarch64
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.hardware - NVIDIA Jetson detected L4T=35.3.1
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.service - Running on Python: 3.8.10
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.cpu - Found 8 CPU
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.gpu - GPU "gv11b" status in /sys/devices/platform/17000000.gv11b
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.gpu - GPU "gv11b" frq in /sys/devices/platform/17000000.gv11b/devfreq/17000000.gv11b
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.processes - Process service started
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.memory - Found EMC!
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.memory - Memory service started
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.engine - Special Engine group found: [dlaX]
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.engine - Special Engine group found: [pvaX]
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.engine - Engines found: [APE CVNAS DLA0 DLA1 NVDEC NVENC NVJPG PVA0 PVA1 SE VIC]
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.temperature - Found thermal "AUX" in thermal_zone2
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.temperature - Found thermal "CPU" in thermal_zone0
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.temperature - Found thermal "thermal" in thermal_zone7
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.temperature - Found thermal "Tboard" in thermal_zone5
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.temperature - Found thermal "AO" in thermal_zone3
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.temperature - Found thermal "GPU" in thermal_zone1
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.temperature - Found thermal "Tdiode" in thermal_zone6
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [WARNING] jtop.core.temperature - Skipped PMIC
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.power - Alarms CV - {'crit_alarm': 0, 'max_alarm': 0}
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.power - Alarms VDDRQ - {'crit_alarm': 0, 'max_alarm': 0}
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.power - Alarms SYS5V - {'crit_alarm': 0, 'max_alarm': 0}
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [WARNING] jtop.core.power - Skipped "sum of shunt voltages" /sys/bus/i2c/devices/1-0041/hwmon/hwmon5/in7_label
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.power - Alarms GPU - {'crit_alarm': 0, 'max_alarm': 0}
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.power - Alarms CPU - {'crit_alarm': 0, 'max_alarm': 0}
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.power - Alarms SOC - {'crit_alarm': 0, 'max_alarm': 0}
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [WARNING] jtop.core.power - Skipped "sum of shunt voltages" /sys/bus/i2c/devices/1-0040/hwmon/hwmon4/in7_label
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.power - Found I2C power monitor
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.power - Found name=1-00081 type=USB model=<EMPTY> in ucsi-source-psy-1-00081
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.power - Found name=1-00082 type=USB model=<EMPTY> in ucsi-source-psy-1-00082
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.power - Found SYSTEM power monitor
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.fan - Fan pwmfan(1) found in /sys/class/hwmon/hwmon3
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.fan - RPM pwm_tach found in /sys/class/hwmon/hwmon2
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.fan - Found nvfancontrol.service
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.jetson_clocks - jetson_clocks found in /usr/bin/jetson_clocks
May 23 21:23:52 jetson-agx-xavier jtop[3018]: [INFO] jtop.core.nvpmodel - nvpmodel running in [0]MAXN - Default: 7
May 23 21:23:52 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - Initialization service
May 23 21:23:54 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - service ready
May 23 21:25:02 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread started 1000ms
May 23 21:25:34 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread close
May 23 21:27:43 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread started 1000ms
May 23 21:27:48 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread close
May 23 21:27:48 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread started 1000ms
May 23 21:27:55 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread close
May 23 21:31:02 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread started 1000ms
May 23 21:31:19 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread close
May 23 21:39:32 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread started 1000ms
May 23 21:39:37 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread close
May 23 21:39:40 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread started 1000ms
May 23 21:39:48 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread close
May 23 21:39:49 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread started 1000ms
May 23 21:39:55 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread close
May 23 21:40:37 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread started 1000ms
May 23 21:40:47 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread close
May 23 21:40:51 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread started 100ms
May 23 21:40:55 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread close
May 23 21:41:02 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread started 100ms
May 23 21:41:05 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread close
May 23 21:41:09 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread started 100ms
May 23 21:41:13 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread close
May 23 21:41:41 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread started 1000ms
May 23 21:41:50 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread close
May 23 21:42:03 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread started 100ms
May 23 21:42:07 jetson-agx-xavier jtop[3176]: [INFO] jtop.service - jtop timer thread close

Log from jetson-stats installation

$ sudo -H pip3 install --no-cache-dir -U jetson-stats
Requirement already up-to-date: jetson-stats in /usr/local/lib/python3.8/dist-packages (4.2.1)
Requirement already satisfied, skipping upgrade: smbus2 in /usr/local/lib/python3.8/dist-packages (from jetson-stats) (0.4.2)
Requirement already satisfied, skipping upgrade: distro in /usr/lib/python3/dist-packages (from jetson-stats) (1.4.0)
rbonghi commented 10 months ago

Hi @leimao

Thank you for your message, and I apologize for my delayed reply. This is a known issue, but not real a bug.

jtop takes time to decode and make data readable. If the interval is too small, jtop won't be able to decode before restarting the monitoring loop.

I suggest lowering the interval to 500ms to avoid issues. I will add a limit in the next release of jtop.