rbonghi / jetson_stats

📊 Simple package for monitoring and control your NVIDIA Jetson [Orin, Xavier, Nano, TX] series
https://rnext.it/jetson_stats
GNU Affero General Public License v3.0
2.17k stars 264 forks source link

jtop crashes after running for several minutes #408

Closed jagtonomy closed 9 months ago

jagtonomy commented 1 year ago

Describe the bug

jtop crashes after running for a while

To Reproduce

Runs fine for a while. Time until failure is not deterministic. It happened twice after running for just a few minutes, but it can run fine for a longer time. The Traceback is always the same. Can be restarted after the crash.

Steps to reproduce the behavior:

  1. Go to '...'
  2. Click on '....'
  3. Scroll down to '....'
  4. See error

Screenshots

If applicable, add screenshots to help explain your problem.

ag@g050-0506:~$ jtop
Traceback (most recent call last):
  File "/usr/local/bin/jtop", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/jtop/__main__.py", line 159, in main
    curses.wrapper(JTOPGUI, jetson, pages, init_page=args.page,
  File "/usr/lib/python3.8/curses/__init__.py", line 105, in wrapper
    return func(stdscr, *args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/jtop/gui/jtopgui.py", line 100, in __init__
    self.run(loop, seconds)
  File "/usr/local/lib/python3.8/dist-packages/jtop/gui/jtopgui.py", line 129, in run
    self.draw()
  File "/usr/local/lib/python3.8/dist-packages/jtop/gui/jtopgui.py", line 143, in draw
    page.draw(self.key, self.mouse)
  File "/usr/local/lib/python3.8/dist-packages/jtop/gui/pgpu.py", line 209, in draw
    self.process_table.draw(first + 2 + gpu_height, 0, width, height_table, key, mouse)
  File "/usr/local/lib/python3.8/dist-packages/jtop/gui/lib/process_table.py", line 70, in draw
    sorted_processes = sorted(processes, key=lambda x: x[self.line_sort], reverse=self.type_reverse)
  File "/usr/local/lib/python3.8/dist-packages/jtop/gui/lib/process_table.py", line 70, in <lambda>
    sorted_processes = sorted(processes, key=lambda x: x[self.line_sort], reverse=self.type_reverse)
IndexError: list index out of range
ag@g050-0506:~$

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Board

Output from jetson_release -v:

Software part of jetson-stats 4.2.1 - (c) 2023, Raffaello Bonghi
Model: Jetson AGX Orin - Jetpack 5.1 [L4T 35.2.1]
NV Power Mode[0]: MAXN
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:
 - 699-level Part Number: 699-13701-0004-500 G.0
 - P-Number: p3701-0004
 - Module: NVIDIA Jetson AGX Orin (32GB ram)
 - SoC: tegra23x
 - CUDA Arch BIN: 8.7
Platform:
 - Machine: aarch64
 - System: Linux
 - Distribution: Ubuntu 20.04 focal
 - Release: 5.10.104-tegra
 - Python: 3.8.10
jtop:
 - Version: 4.2.1
 - Service: Active
Libraries:
 - CUDA: Not installed
 - cuDNN: Not installed
 - TensorRT: Not installed
 - VPI: Not installed
 - OpenCV: Not installed

Log from jtop.service

Attach here the output from: journalctl -u jtop.service -n 100 --no-pager

May 07 19:01:46 g050-0506 systemd[1]: Started jtop service.
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.service - jetson_stats 4.2.1 - server loaded
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.hardware - Hardware detected aarch64
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.hardware - NVIDIA Jetson detected L4T=35.2.1
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.service - Running on Python: 3.8.10
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.cpu - Found 8 CPU
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.gpu - GPU "ga10b" status in /sys/devices/platform/17000000.ga10b
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.gpu - GPU "ga10b" frq in /sys/devices/platform/17000000.ga10b/devfreq/17000000.ga10b
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.processes - Process service started
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.memory - Found EMC!
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.memory - Memory service started
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.engine - Special Engine group found: [dlaX]
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.engine - Special Engine group found: [pvaX]
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.engine - Engines found: [APE DLA0 DLA1 NVDEC NVENC NVJPG PVA0 SE VIC]
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.temperature - Found thermal "CV0" in thermal_zone2
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.temperature - Found thermal "CPU" in thermal_zone0
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.temperature - Found thermal "Tboard" in thermal_zone9
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.temperature - Found thermal "SOC2" in thermal_zone7
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.temperature - Found thermal "Tdiode" in thermal_zone10
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.temperature - Found thermal "SOC0" in thermal_zone5
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.temperature - Found thermal "CV1" in thermal_zone3
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.temperature - Found thermal "GPU" in thermal_zone1
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.temperature - Found thermal "tj" in thermal_zone8
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.temperature - Found thermal "iwlwifi" in thermal_zone11
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.temperature - Found thermal "SOC1" in thermal_zone6
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.temperature - Found thermal "CV2" in thermal_zone4
May 07 19:01:47 g050-0506 jtop[7658]: [WARNING] jtop.core.power - Skipped NC /sys/bus/i2c/devices/1-0041/hwmon/hwmon3/in1_label
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.power - Alarms VDDQ_VDD2_1V8AO - {'crit_alarm': 0, 'max_alarm': 0}
May 07 19:01:47 g050-0506 jtop[7658]: [WARNING] jtop.core.power - Skipped NC /sys/bus/i2c/devices/1-0041/hwmon/hwmon3/in3_label
May 07 19:01:47 g050-0506 jtop[7658]: [WARNING] jtop.core.power - Skipped "sum of shunt voltages" /sys/bus/i2c/devices/1-0041/hwmon/hwmon3/in7_label
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.power - Alarms VDD_GPU_SOC - {'crit_alarm': 0, 'max_alarm': 0}
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.power - Alarms VDD_CPU_CV - {'crit_alarm': 0, 'max_alarm': 0}
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.power - Alarms VIN_SYS_5V0 - {'crit_alarm': 0, 'max_alarm': 0}
May 07 19:01:47 g050-0506 jtop[7658]: [WARNING] jtop.core.power - Skipped "sum of shunt voltages" /sys/bus/i2c/devices/1-0040/hwmon/hwmon2/in7_label
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.power - Found I2C power monitor
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.fan - Fan pwmfan(1) found in /sys/class/hwmon/hwmon4
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.fan - RPM pwm_tach found in /sys/class/hwmon/hwmon0
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.fan - Found nvfancontrol.service
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.jetson_clocks - jetson_clocks found in /usr/bin/jetson_clocks
May 07 19:01:47 g050-0506 jtop[7658]: [INFO] jtop.core.nvpmodel - nvpmodel running in [0]MAXN - Default: 0
May 07 19:01:47 g050-0506 jtop[7917]: [INFO] jtop.service - Initialization service
May 07 19:01:48 g050-0506 jtop[7917]: [WARNING] jtop.core.jetson_clocks - I can't store jetson_clocks configuration is already running!
May 07 19:01:48 g050-0506 jtop[7917]: [INFO] jtop.service - service ready
May 07 22:43:12 g050-0506 jtop[7917]: [INFO] jtop.service - jtop timer thread started 1000ms
May 07 22:45:29 g050-0506 jtop[7917]: [INFO] jtop.service - jtop timer thread close

Log from jetson-stats installation

Attach here the output from: sudo -H pip3 install --no-cache-dir -U jetson-stats

sheldonmaschmeyer commented 1 year ago

I confirm this issue on the latest version 4.2.1 This issue does not exist on version 4.1.5 (will be reverting to for now). 4.2.1 crashes every 6.x minutes (can time it). The service runs without incident unless called. i.e. when jtop is running (command line strings or GUI front), it will crash in 6 minutes.

jtop                               
Traceback (most recent call last):
  File "/usr/local/bin/jtop", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.6/dist-packages/jtop/__main__.py", line 160, in main
    loop=args.loop, seconds=LOOP_SECONDS, color_filter=color_filter)
  File "/usr/lib/python3.6/curses/__init__.py", line 94, in wrapper
    return func(stdscr, *args, **kwds)
  File "/usr/local/lib/python3.6/dist-packages/jtop/gui/jtopgui.py", line 100, in __init__
    self.run(loop, seconds)
  File "/usr/local/lib/python3.6/dist-packages/jtop/gui/jtopgui.py", line 129, in run
    self.draw()
  File "/usr/local/lib/python3.6/dist-packages/jtop/gui/jtopgui.py", line 143, in draw
    page.draw(self.key, self.mouse)
  File "/usr/local/lib/python3.6/dist-packages/jtop/gui/pall.py", line 147, in draw
    line_counter += self.process_table.draw(line_counter, 0, width, height_free_area, key, mouse)
  File "/usr/local/lib/python3.6/dist-packages/jtop/gui/lib/process_table.py", line 70, in draw
    sorted_processes = sorted(processes, key=lambda x: x[self.line_sort], reverse=self.type_reverse)
  File "/usr/local/lib/python3.6/dist-packages/jtop/gui/lib/process_table.py", line 70, in <lambda>
    sorted_processes = sorted(processes, key=lambda x: x[self.line_sort], reverse=self.type_reverse)
IndexError: list index out of range

jetson_release -v Software part of jetson-stats 4.2.1 - (c) 2023, Raffaello Bonghi Model: Jetson-AGX - Jetpack 4.6 [L4T 32.6.1] NV Power Mode[0]: MAXN Hardware:

sheldonmaschmeyer commented 1 year ago

I seem to have been able to work around this issue (Note: I am using Docker). Instead of the below which worked prior to 4.2.1 (called every 5 seconds)

jetson = jtop()
jetson.start()
stats = jetson.stats
stats['time'] = stats['time'].isoformat() + 'Z'
stats['uptime'] = iso8601(stats['uptime'])
print(stats)
jetson.close()

I am now using:

with jtop() as jetson:
    # jetson.ok() will provide the proper update frequency
    while jetson.ok():
        # Read tegra stats
        print(jetson.json())
        sys.stdout.flush()

The above does not seem to crash.

The images (host and Docker) have been updated to 20.04, Jetpack 5.1.1 and L4T 35.3.1

ratsputin commented 1 year ago

Also seeing this on 4.2.1

Traceback (most recent call last):
  File "/usr/local/bin/jtop", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/dist-packages/jtop/__main__.py", line 159, in main
    curses.wrapper(JTOPGUI, jetson, pages, init_page=args.page,
  File "/usr/lib/python3.8/curses/__init__.py", line 105, in wrapper
    return func(stdscr, *args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/jtop/gui/jtopgui.py", line 100, in __init__
    self.run(loop, seconds)
  File "/usr/local/lib/python3.8/dist-packages/jtop/gui/jtopgui.py", line 129, in run
    self.draw()
  File "/usr/local/lib/python3.8/dist-packages/jtop/gui/jtopgui.py", line 143, in draw
    page.draw(self.key, self.mouse)
  File "/usr/local/lib/python3.8/dist-packages/jtop/gui/pall.py", line 147, in draw
    line_counter += self.process_table.draw(line_counter, 0, width, height_free_area, key, mouse)
  File "/usr/local/lib/python3.8/dist-packages/jtop/gui/lib/process_table.py", line 70, in draw
    sorted_processes = sorted(processes, key=lambda x: x[self.line_sort], reverse=self.type_reverse)
  File "/usr/local/lib/python3.8/dist-packages/jtop/gui/lib/process_table.py", line 70, in <lambda>
    sorted_processes = sorted(processes, key=lambda x: x[self.line_sort], reverse=self.type_reverse)
IndexError: list index out of range
jetson_release -v
Software part of jetson-stats 4.2.1 - (c) 2023, Raffaello Bonghi
Model: NVIDIA Orin NX Developer Kit - Jetpack 5.1 [L4T 35.2.1]
NV Power Mode[0]: MAXN
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:
 - 699-level Part Number: 699-13767-0000-300 H.2
 - P-Number: p3767-0000
 - Module: NVIDIA Jetson Orin NX (16GB ram)
 - SoC: tegra23x
 - CUDA Arch BIN: 8.7
 - Codename: P3768
Platform:
 - Machine: aarch64
 - System: Linux
 - Distribution: Ubuntu 20.04 focal
 - Release: 5.10.104-tegra
 - Python: 3.8.10
jtop:
 - Version: 4.2.1
 - Service: Active
Libraries:
 - CUDA: 11.4.315
 - cuDNN: 8.6.0.166
 - TensorRT: 5.1
 - VPI: 2.2.4
 - Vulkan: 1.3.204
 - OpenCV: 4.5.4 - with CUDA: NO
ksaye commented 11 months ago

Seeing this with 4.2.3

  ksaye@xavier:~$ jtop
  Traceback (most recent call last):
    File "/usr/local/bin/jtop", line 8, in <module>
      sys.exit(main())
    File "/usr/local/lib/python3.8/dist-packages/jtop/__main__.py", line 159, in main
      curses.wrapper(JTOPGUI, jetson, pages, init_page=args.page,
    File "/usr/lib/python3.8/curses/__init__.py", line 105, in wrapper
      return func(stdscr, *args, **kwds)
    File "/usr/local/lib/python3.8/dist-packages/jtop/gui/jtopgui.py", line 100, in __init__
      self.run(loop, seconds)
    File "/usr/local/lib/python3.8/dist-packages/jtop/gui/jtopgui.py", line 129, in run
      self.draw()
    File "/usr/local/lib/python3.8/dist-packages/jtop/gui/jtopgui.py", line 143, in draw
      page.draw(self.key, self.mouse)
    File "/usr/local/lib/python3.8/dist-packages/jtop/gui/pall.py", line 147, in draw
      line_counter += self.process_table.draw(line_counter, 0, width, height_free_area, key, mouse)
    File "/usr/local/lib/python3.8/dist-packages/jtop/gui/lib/process_table.py", line 70, in draw
      sorted_processes = sorted(processes, key=lambda x: x[self.line_sort], reverse=self.type_reverse)
    File "/usr/local/lib/python3.8/dist-packages/jtop/gui/lib/process_table.py", line 70, in <lambda>
      sorted_processes = sorted(processes, key=lambda x: x[self.line_sort], reverse=self.type_reverse)
  IndexError: list index out of range

  ksaye@xavier:~$ jetson_release -v
  Software part of jetson-stats 4.2.3 - (c) 2023, Raffaello Bonghi
  Model: Jetson-AGX - Jetpack 5.1.2 [L4T 35.4.1]
  NV Power Mode[3]: MODE_30W_ALL
  Serial Number: [XXX Show with: jetson_release -s XXX]
  Hardware:
   - 699-level Part Number: 699-82888-0004-400 L.0
   - P-Number: p2888-0004
   - Module: NVIDIA Jetson AGX Xavier (32 GB ram)
   - SoC: tegra194
   - CUDA Arch BIN: 7.2
   - Codename: Galen
  Platform:
   - Machine: aarch64
   - System: Linux
   - Distribution: Ubuntu 20.04 focal
   - Release: 5.10.120-tegra
   - Python: 3.8.10
  jtop:
   - Version: 4.2.3
   - Service: Active
  Libraries:
   - CUDA: 11.4.315
   - cuDNN: 8.6.0.166
   - TensorRT: 8.5.2.2
   - VPI: 2.3.9
   - Vulkan: 1.3.204
   - OpenCV: 4.5.4 - with CUDA: NO

As a workaround, I modified line 70 in '/usr/local/lib/python3.8/dist-packages/jtop/gui/lib/process_table.py' to be:

    # Sort table for selected line
    try:
        sorted_processes = processes
        sorted_processes = sorted(processes, key=lambda x: x[self.line_sort], reverse=self.type_reverse)
    except:
        pass
    # Draw all processes
rbonghi commented 10 months ago

I apologize for my really late reply, but looking at this thread, I know how to fix it! Thank you in advance for this support!

The next jtop release will fix this bug!

rbonghi commented 10 months ago

Thank you, @ksaye; I added your workaround on top. I'll release the new update on the next jetson-stats release