rbonghi / jetson_stats

📊 Simple package for monitoring and control your NVIDIA Jetson [Orin, Xavier, Nano, TX] series
https://rnext.it/jetson_stats
GNU Affero General Public License v3.0
2.08k stars 250 forks source link

jtop.service becomes inactive #525

Open colingoodman opened 1 month ago

colingoodman commented 1 month ago

Describe the bug

After a period of time, the "jtop" command begins to fail.

$ jtop
I can't access jtop.service.
Please logout or reboot this board.
$ sudo jtop
The jtop.service is not active. Please run:
sudo systemctl restart jtop.service
$ sudo systemctl restart jtop.service
$ jtop
...

To Reproduce

Restart board or jtop.service. Wait several minutes.

Board

$ jetson_release -v
Software part of jetson-stats 4.2.7 - (c) 2024, Raffaello Bonghi
Model: NVIDIA Jetson Xavier NX Developer Kit - Jetpack 5.1.1 [L4T 35.3.1]
NV Power Mode[8]: MODE_20W_6CORE
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:
 - 699-level Part Number: 699-13668-0000-300 B.0
 - P-Number: p3668-0000
 - Module: NVIDIA Jetson Xavier NX (Developer kit)
 - SoC: tegra194
 - CUDA Arch BIN: 7.2
 - Codename: Jakku
Platform:
 - Machine: aarch64
 - System: Linux
 - Distribution: Ubuntu 20.04 focal
 - Release: 5.10.104-tegra
 - Python: 3.8.10
jtop:
 - Version: 4.2.7
 - Service: Active
Libraries:
 - CUDA: 11.4.315
 - cuDNN: 8.6.0.166
 - TensorRT: 8.5.2.2
 - VPI: 2.2.6
 - Vulkan: 1.3.204
 - OpenCV: 4.5.4 - with CUDA: NO

Log from jtop.service

The key to my problem appears to be in here, ProcessLookupError: [Errno 3] No such process

$ journalctl -u jtop.service -n 100 --no-pager
-- Logs begin at Mon 2023-03-27 11:54:08 MDT, end at Mon 2024-05-27 20:54:18 MDT. --
May 27 20:01:35 edge jtop[3105]: [INFO] jtop.service - jtop timer thread started 1000ms
May 27 20:02:09 edge jtop[3105]: [INFO] jtop.service - jtop timer thread close
May 27 20:02:35 edge jtop[3105]: [INFO] jtop.service - jtop timer thread started 1000ms
May 27 20:03:09 edge jtop[3105]: [INFO] jtop.service - jtop timer thread close
May 27 20:03:49 edge jtop[3105]: [INFO] jtop.service - jtop timer thread started 1000ms
May 27 20:04:14 edge jtop[3105]: [INFO] jtop.service - jtop timer thread close
May 27 20:04:39 edge jtop[3105]: [INFO] jtop.service - jtop timer thread started 1000ms
May 27 20:05:14 edge jtop[3105]: [INFO] jtop.service - jtop timer thread close
May 27 20:05:40 edge jtop[3105]: [INFO] jtop.service - jtop timer thread started 1000ms
May 27 20:06:16 edge jtop[3105]: [INFO] jtop.service - jtop timer thread close
May 27 20:07:03 edge jtop[3105]: [INFO] jtop.service - jtop timer thread started 1000ms
May 27 20:07:08 edge jtop[3105]: [CRITICAL] jtop.core.timer_reader - Exception in 'timer_reader thread': [Errno 3] No such process
May 27 20:07:10 edge jtop[3105]: [ERROR] jtop.service - Error subprocess [Errno 3] No such process
May 27 20:07:10 edge jtop[3105]: Traceback (most recent call last):
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/service.py", line 414, in run
May 27 20:07:10 edge jtop[3105]:     if self._timer_reader.open(interval=interval):
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/timer_reader.py", line 62, in open
May 27 20:07:10 edge jtop[3105]:     self._error_status()
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/timer_reader.py", line 90, in _error_status
May 27 20:07:10 edge jtop[3105]:     raise ex_value
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/timer_reader.py", line 46, in _timer_callback
May 27 20:07:10 edge jtop[3105]:     self._callback()
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/service.py", line 605, in jtop_stats
May 27 20:07:10 edge jtop[3105]:     data = self.jtop_decode()
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/service.py", line 569, in jtop_decode
May 27 20:07:10 edge jtop[3105]:     total, table = self.processes.get_status()
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/processes.py", line 136, in get_status
May 27 20:07:10 edge jtop[3105]:     table = [self.get_process_info(prc[0], prc[3], prc[2], uptime) for prc in table]
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/processes.py", line 136, in <listcomp>
May 27 20:07:10 edge jtop[3105]:     table = [self.get_process_info(prc[0], prc[3], prc[2], uptime) for prc in table]
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/processes.py", line 101, in get_process_info
May 27 20:07:10 edge jtop[3105]:     mem_raw = cat(os.path.join('/proc', pid, 'statm')).split()
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/common.py", line 110, in cat
May 27 20:07:10 edge jtop[3105]:     return f.readline().rstrip('\x00')
May 27 20:07:10 edge jtop[3105]: ProcessLookupError: [Errno 3] No such process
May 27 20:07:10 edge jtop[3105]: Process JtopServer-1:
May 27 20:07:10 edge jtop[3105]: Traceback (most recent call last):
May 27 20:07:10 edge jtop[3105]:   File "/usr/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
May 27 20:07:10 edge jtop[3105]:     self.run()
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/service.py", line 466, in run
May 27 20:07:10 edge jtop[3105]:     if self._timer_reader.close(timeout=TIMEOUT_SWITCHOFF):
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/timer_reader.py", line 75, in close
May 27 20:07:10 edge jtop[3105]:     self._error_status()
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/timer_reader.py", line 90, in _error_status
May 27 20:07:10 edge jtop[3105]:     raise ex_value
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/timer_reader.py", line 46, in _timer_callback
May 27 20:07:10 edge jtop[3105]:     self._callback()
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/service.py", line 605, in jtop_stats
May 27 20:07:10 edge jtop[3105]:     data = self.jtop_decode()
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/service.py", line 569, in jtop_decode
May 27 20:07:10 edge jtop[3105]:     total, table = self.processes.get_status()
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/processes.py", line 136, in get_status
May 27 20:07:10 edge jtop[3105]:     table = [self.get_process_info(prc[0], prc[3], prc[2], uptime) for prc in table]
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/processes.py", line 136, in <listcomp>
May 27 20:07:10 edge jtop[3105]:     table = [self.get_process_info(prc[0], prc[3], prc[2], uptime) for prc in table]
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/processes.py", line 101, in get_process_info
May 27 20:07:10 edge jtop[3105]:     mem_raw = cat(os.path.join('/proc', pid, 'statm')).split()
May 27 20:07:10 edge jtop[3105]:   File "/usr/local/lib/python3.8/dist-packages/jtop/core/common.py", line 110, in cat
May 27 20:07:10 edge jtop[3105]:     return f.readline().rstrip('\x00')
May 27 20:07:10 edge jtop[3105]: ProcessLookupError: [Errno 3] No such process
May 27 20:07:10 edge jtop[2963]: [INFO] jtop.service - Service closed
May 27 20:07:11 edge systemd[1]: jtop.service: Succeeded.
May 27 20:54:01 edge systemd[1]: Started jtop service.
May 27 20:54:02 edge jtop[57871]: [INFO] jtop.service - jetson_stats 4.2.7 - server loaded
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.service - Running on Python: 3.8.10
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.hardware - Hardware detected aarch64
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.hardware - NVIDIA Jetson 699-level Part Number=699-13668-0000-300 B.0
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.hardware - NVIDIA Jetson Module=NVIDIA Jetson Xavier NX (Developer kit)
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.hardware - NVIDIA Jetson detected L4T=35.3.1
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.cpu - Found 6 CPU
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.gpu - GPU "gv11b" status in /sys/devices/platform/17000000.gv11b
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.gpu - GPU "gv11b" frq in /sys/devices/platform/17000000.gv11b/devfreq/17000000.gv11b
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.processes - Process service started
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.memory - Found EMC!
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.memory - Memory service started
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.engine - Special Engine group found: [dlaX]
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.engine - Special Engine group found: [pvaX]
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.engine - Engines found: [APE CVNAS DLA0 DLA1 NVDEC NVENC NVJPG PVA0 PVA1 SE VIC]
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.temperature - Found thermal "AUX" in thermal_zone2
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.temperature - Found thermal "CPU" in thermal_zone0
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.temperature - Found thermal "thermal" in thermal_zone5
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.temperature - Found thermal "AO" in thermal_zone3
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.temperature - Found thermal "GPU" in thermal_zone1
May 27 20:54:03 edge jtop[57871]: [WARNING] jtop.core.temperature - Skipped PMIC
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.power - Alarms VDD_IN - {'crit_alarm': 0, 'max_alarm': 0}
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.power - Alarms VDD_CPU_GPU_CV - {'crit_alarm': 0, 'max_alarm': 0}
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.power - Alarms VDD_SOC - {'crit_alarm': 0, 'max_alarm': 0}
May 27 20:54:03 edge jtop[57871]: [WARNING] jtop.core.power - Skipped "sum of shunt voltages" /sys/bus/i2c/devices/7-0040/hwmon/hwmon5/in7_label
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.power - Found I2C power monitor
May 27 20:54:03 edge jtop[57871]: [WARNING] jtop.core.power - Skipped usb-charger type=USB in=usb-charger
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.fan - Fan pwmfan(1) found in /sys/class/hwmon/hwmon4
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.fan - RPM pwm_tach found in /sys/class/hwmon/hwmon2
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.fan - Found nvfancontrol.service
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.jetson_clocks - jetson_clocks found in /usr/bin/jetson_clocks
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.core.nvpmodel - nvpmodel running in [8]MODE_20W_6CORE - Default: 5
May 27 20:54:03 edge jtop[57871]: [INFO] jtop.service - Remove folder /run/jtop.sock
May 27 20:54:04 edge jtop[57894]: [INFO] jtop.service - Initialization service
May 27 20:54:06 edge jtop[57894]: [INFO] jtop.service - service ready
May 27 20:54:07 edge jtop[57894]: [INFO] jtop.service - jtop timer thread started 1000ms
May 27 20:54:18 edge jtop[57894]: [INFO] jtop.service - jtop timer thread close

Log from jetson-stats installation

https://pastebin.com/vDWHzKhx

RAW Data

--------------------- PLATFORM -------------------------
Machine: aarch64
System: Linux
Distribution: Ubuntu 20.04 focal
Release: 5.10.104-tegra
Python: 3.8.10
-------------------- JETSON RAW OUTPUT -----------------
------------------
Path: /etc/nv_tegra_release
# R35 (release), REVISION: 3.1, GCID: 32827747, BOARD: t186ref, EABI: aarch64, DATE: Sun Mar 19 15:19:21 UTC 2023
------------------
Path: /sys/firmware/devicetree/base/model
NVIDIA Jetson Xavier NX Developer Kit
------------------
Path: /proc/device-tree/nvidia,boardids
No such file or directory
------------------
Path: /proc/device-tree/compatible
nvidia,p3449-0000+p3668-0000nvidia,p3509-0000+p3668-0000nvidia,tegra194
------------------
Path: /proc/device-tree/nvidia,dtsfilename
/dvs/git/dirty/git-master_linux/kernel/kernel-5.10/arch/arm64/boot/dts/../../../../../../hardware/nvidia/platform/t19x/jakku/kernel-dts/tegra194-p3668-0000-p3509-0000.dts
------------------
Path: I2C-0-0x50
01 00 FC 00 54 0E 00 00 03 42 00 00 00 00 00 00    ..ü.T....B......
00 00 00 00 36 39 39 2D 31 33 36 36 38 2D 30 30    ....699-13668-00
30 30 2D 33 30 30 20 42 2E 30 00 00 00 00 00 00    00-300 B.0......
00 00 FF FF FF FF FF FF FF FF FF FF FF FF FF FF    ..ÿÿÿÿÿÿÿÿÿÿÿÿÿÿ
FF FF FF FF 4A 7E 3D 2D B0 48 31 34 32 30 38 32    ÿÿÿÿJ~=-°H142082
31 30 36 32 35 37 34 00 00 00 00 00 00 00 00 00    1062574.........
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
00 00 00 00 00 00 4E 56 43 42 1C 00 4D 31 00 00    ......NVCB..M1..
FF FF FF FF FF FF FF FF FF FF FF FF 4A 7E 3D 2D    ÿÿÿÿÿÿÿÿÿÿÿÿJ~=-
B0 48 00 00 00 00 00 00 00 00 00 00 00 00 00 00    °H..............
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00    ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 B1    ...............±

------------------
Path: I2C-0
FAIL
------------------
Path: I2C-1
FAIL
------------------
Path: I2C-2
FAIL
------------------
Path: I2C-7
FAIL

-------------------- IGPU OUTPUT ---------------------
------------------
Path: /sys/class/devfreq/15a80000.nvenc1/device/of_node/name
nvenc1
------------------
Path: /sys/class/devfreq/15480000.nvdec/device/of_node/name
nvdec
------------------
Path: /sys/class/devfreq/154c0000.nvenc/device/of_node/name
nvenc
------------------
Path: /sys/class/devfreq/17000000.gv11b/device/of_node/name
gv11b
------------------
Path: /sys/class/devfreq/15140000.nvdec1/device/of_node/name
nvdec1
------------------
Path: /sys/class/devfreq/15340000.vic/device/of_node/name
vic

Log from jtop 4.2.7