rbonghi / jetson_stats

📊 Simple package for monitoring and control your NVIDIA Jetson [Orin, Xavier, Nano, TX] series
https://rnext.it/jetson_stats
GNU Affero General Public License v3.0
2.18k stars 264 forks source link

Wrong display when GPU memory is just a little bit over 1G #413

Open clogwog opened 1 year ago

clogwog commented 1 year ago

Describe the bug

When GPU memory goes up to 1 G it shows it wrongly as 101M for a bit.. Once it gets higher i think I have seen it display 1G properly.. but it must be around the change-over from M to G. This confused me a bit because i thought i was running light on memory, until i saw the startup sequence as shown in the video below and confirmed it with the graph on page 2(GPU)

To Reproduce

When starting a program that will take heaps of GPU memory it goes up to 900M then it suddenly shows 101M even though the memory has still gone up.

It shows it wrong in both the 1(All) screen next to the process as well in the 2(GPU) total GPU graph

See this example https://www.youtube.com/watch?v=R_rWsqLXMfw

Expected behavior

just showing either 1000 M or 1 G

Board

Output from jetson_release -v:

$ sudo jetson_release -v
Software part of jetson-stats 4.2.1 - (c) 2023, Raffaello Bonghi
Model: lanai-3636 - Jetpack 4.6.3 [L4T 32.7.3]
NV Power Mode[3]: MAXP_CORE_ARM
Serial Number: [XXX Show with: jetson_release -s XXX]
Hardware:
 - BoardIDs: p3636-0001-A0
 - Module: Not available
 - SoC: tegra186
 - CUDA Arch BIN: 6.2
 - Codename: Lanai
Platform:
 - Machine: aarch64
 - System: Linux
 - Distribution: Ubuntu 18.04 Bionic Beaver
 - Release: 4.9.299-tegra
 - Python: 3.6.9
jtop:
 - Version: 4.2.1
 - Service: Active
Libraries:
 - CUDA: 10.2.300
 - cuDNN: 8.2.1.32
 - TensorRT: 8.2
 - VPI: 1.2.3
 - Vulkan: 1.2.70
 - OpenCV: 4.1.1 - with CUDA: NO

You can find this data on:

Log from jtop.service

Attach here the output from: journalctl -u jtop.service -n 100 --no-pager

$ journalctl -u jtop.service -n 100 --no-pager
-- Logs begin at Tue 2023-05-23 22:21:48 UTC, end at Tue 2023-05-23 22:36:26 UTC. --
May 23 22:22:05 jeteye systemd[1]: Started jtop service.
May 23 22:22:05 jeteye jtop[7633]: [INFO] jtop.service - jetson_stats 4.2.1 - server loaded
May 23 22:22:05 jeteye jtop[7633]: [INFO] jtop.core.config - Load config from /usr/local/jtop/config.json
May 23 22:22:05 jeteye jtop[7633]: [INFO] jtop.core.hardware - Hardware detected aarch64
May 23 22:22:05 jeteye jtop[7633]: [INFO] jtop.core.hardware - NVIDIA Jetson detected L4T=32.7.3
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.service - Running on Python: 3.6.9
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.cpu - Found 6 CPU
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.gpu - GPU "gp10b" status in /sys/devices/17000000.gp10b
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.gpu - GPU "gp10b" frq in /sys/devices/17000000.gp10b/devfreq/17000000.gp10b
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.processes - Process service started
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.memory - Found EMC!
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.memory - Memory service started
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.engine - Engines found: [APE NVDEC NVENC NVJPG SE VIC]
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.temperature - Found thermal "PLL" in thermal_zone3
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.temperature - Found thermal "MCPU" in thermal_zone1
May 23 22:22:06 jeteye jtop[7633]: [WARNING] jtop.core.temperature - Skipped PMIC
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.temperature - Found thermal "GPU" in thermal_zone2
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.temperature - Found thermal "BCPU" in thermal_zone0
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.temperature - Found thermal "thermal" in thermal_zone5
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.power - Found I2C power monitor
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.fan - Fan tegra_pwmfan(1) found in /sys/class/hwmon/hwmon2
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.fan - RPM tegra_pwmfan(1) found in /sys/class/hwmon/hwmon2
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.fan - RPM pwm_tach found in /sys/class/hwmon/hwmon1
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.fan - Fan temp controller tegra_pwmfan found in /sys/class/hwmon/hwmon2/temp_control
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.jetson_clocks - jetson_clocks found in /usr/bin/jetson_clocks
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.nvpmodel - nvpmodel running in [3]MAXP_CORE_ARM - Default: 3
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.service - Initialization service
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.fan - Initialization tegra_pwmfan
May 23 22:22:06 jeteye jtop[7633]: [WARNING] jtop.core.fan - Fan tegra_pwmfan profile temp_control already active
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.jetson_clocks - Starting jetson_clocks on boot
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.service - service ready
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.jetson_clocks - Starting jetson_clocks in: 39.42s
May 23 22:22:06 jeteye jtop[7633]: [INFO] jtop.core.jetson_clocks - Start jetson_clocks with booting
May 23 22:22:18 jeteye systemd[1]: Stopping jtop service...
May 23 22:22:18 jeteye jtop[7633]: [INFO] jtop.__main__ - Close service by signal 15
May 23 22:22:18 jeteye jtop[7633]: [WARNING] jtop.service - KeyboardInterrupt, SystemExit interrupt
May 23 22:22:18 jeteye jtop[7633]: [INFO] jtop.__main__ - Close service by signal 15
May 23 22:22:18 jeteye jtop[7633]: [INFO] jtop.service - FORCE jtop timer thread close
May 23 22:22:18 jeteye jtop[7633]: [INFO] jtop.__main__ - Close service by signal 15
May 23 22:22:18 jeteye jtop[7633]: [INFO] jtop.service - Terminate subprocess
May 23 22:22:18 jeteye jtop[7633]: [INFO] jtop.service - Wait shutdown subprocess
May 23 22:22:19 jeteye jtop[7633]: [INFO] jtop.service - Service closed
May 23 22:22:19 jeteye systemd[1]: Stopped jtop service.
May 23 22:22:19 jeteye systemd[1]: Started jtop service.
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.service - jetson_stats 4.2.1 - server loaded
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.config - Load config from /usr/local/jtop/config.json
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.hardware - Hardware detected aarch64
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.hardware - NVIDIA Jetson detected L4T=32.7.3
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.service - Running on Python: 3.6.9
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.cpu - Found 6 CPU
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.gpu - GPU "gp10b" status in /sys/devices/17000000.gp10b
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.gpu - GPU "gp10b" frq in /sys/devices/17000000.gp10b/devfreq/17000000.gp10b
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.processes - Process service started
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.memory - Found EMC!
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.memory - Memory service started
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.engine - Engines found: [APE NVDEC NVENC NVJPG SE VIC]
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.temperature - Found thermal "PLL" in thermal_zone3
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.temperature - Found thermal "MCPU" in thermal_zone1
May 23 22:22:20 jeteye jtop[9508]: [WARNING] jtop.core.temperature - Skipped PMIC
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.temperature - Found thermal "GPU" in thermal_zone2
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.temperature - Found thermal "BCPU" in thermal_zone0
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.temperature - Found thermal "thermal" in thermal_zone5
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.power - Found I2C power monitor
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.fan - Fan tegra_pwmfan(1) found in /sys/class/hwmon/hwmon2
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.fan - RPM tegra_pwmfan(1) found in /sys/class/hwmon/hwmon2
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.fan - RPM pwm_tach found in /sys/class/hwmon/hwmon1
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.fan - Fan temp controller tegra_pwmfan found in /sys/class/hwmon/hwmon2/temp_control
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.jetson_clocks - jetson_clocks found in /usr/bin/jetson_clocks
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.nvpmodel - nvpmodel running in [3]MAXP_CORE_ARM - Default: 3
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.service - Initialization service
May 23 22:22:20 jeteye jtop[9508]: [INFO] jtop.core.fan - Initialization tegra_pwmfan
May 23 22:22:20 jeteye jtop[9508]: [WARNING] jtop.core.fan - Fan tegra_pwmfan profile temp_control already active
May 23 22:22:21 jeteye jtop[9508]: [INFO] jtop.core.jetson_clocks - Starting jetson_clocks on boot
May 23 22:22:21 jeteye jtop[9508]: [INFO] jtop.service - service ready
May 23 22:22:21 jeteye jtop[9508]: [INFO] jtop.core.jetson_clocks - Starting jetson_clocks in: 24.78s
May 23 22:22:21 jeteye jtop[9508]: [INFO] jtop.core.jetson_clocks - Start jetson_clocks with booting
May 23 22:22:32 jeteye jtop[9508]: [INFO] jtop.service - jtop timer thread started 1000ms
May 23 22:22:46 jeteye jtop[9508]: [WARNING] jtop.core.fan - Fan tegra_pwmfan profile temp_control already active
May 23 22:22:46 jeteye jtop[9508]: [INFO] jtop.core.jetson_clocks - jetson_clocks started

Log from jetson-stats installation

Attach here the output from: sudo -H pip3 install --no-cache-dir -U jetson-stats

$ sudo -H pip3 install --no-cache-dir -U jetson-stats
/usr/lib/python3/dist-packages/secretstorage/dhcrypto.py:15: CryptographyDeprecationWarning: Python 3.6 is no longer supported by the Python core team. Therefore, support for it is deprecated in cryptography. The next release of cryptography will remove support for Python 3.6.
  from cryptography.utils import int_from_bytes
Requirement already satisfied: jetson-stats in /usr/local/lib/python3.6/dist-packages (4.2.1)
Requirement already satisfied: smbus2 in /usr/local/lib/python3.6/dist-packages (from jetson-stats) (0.4.2)
Requirement already satisfied: distro in /usr/local/lib/python3.6/dist-packages (from jetson-stats) (1.8.0)
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
$ jtop -v
jtop 4.2.1
clogwog commented 1 year ago

just noticed that page 4MEM shows the same as well: GPU Sh: 101M

Model: lanai-3636 - Jetpack 4.6.3 [L4T 32.7.3]
 RAM 2.9G/3.7GB - (lfb 0x4MB)                                         RAM
                                                         ├ 3.7G  Used:    2.9G
 ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁├ 3.5G  GPU Sh:  101M
 ████████████████████████████████████████████████████████├ 3.3G  Buffers: 26.8M
 ████████████████████████████████████████████████████████├ 3.1G  Cached:  553M
 ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁├ 2.9G  Free:    382M
 ████████████████████████████████████████████████████████├ 2.7G  TOT:     3.7G
 ████████████████████████████████████████████████████████├ 2.5G
 ████████████████████████████████████████████████████████├ 2.3G
 ████████████████████████████████████████████████████████├ 2.1G
 ▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁├ 1.9G
 ████████████████████████████████████████████████████████├ 1.8G
 ████████████████████████████████████████████████████████├ 1.6G
 ████████████████████████████████████████████████████████├ 1.4G
 ████████████████████████████████████████████████████████├ 1.2G
 ████████████████████████████████████████████████████████├ 1.0G
 ████████████████████████████████████████████████████████├ 0.8G
 ████████████████████████████████████████████████████████├ 0.6G
 ████████████████████████████████████████████████████████├ 0.4G
 ████████████████████████████████████████████████████████├ 0.2G
         └ -8s       └ -6s       └ -4s       └ -2s       0 time
 Emc [204MHz::::::::::::::::::::::::::::1.6GHz] 1.6GHz   0%
 SWAP 594M/9.9G (Cached 45.5M)                              [c| clear cache]
 zram0 [P5|||     148M/478M]  zram2 [P5|||     148M/478M]
 zram1 [P5|||     148M/478M]  zram3 [P5|||     148M/478M]   [Select swap]
 swapfile [P1                               0k/8.0G] Boot
                                                            [s| Create new]
                                                            [b| on boot]
                                                            [-]  1  GB [+]
                                                            New: /swfile

 1ALL  2GPU  3CPU  4MEM  5ENG  6CTRL  7INFO  Quit                  (c) 2023, RB
rbonghi commented 10 months ago

Hi @clogwog (also reply for @artur-ag and @vesselofgod #469 )

I am aware of an issue where the output plotting in my jtop new version generates errors. I have worked to reduce the occurrence of this bug. The issue stems from the use of ASCII characters to plot the output, and there are cases where there is no small or big ASCII block available to perfectly plot the output.

I'm not sure if you have any experience with ASCII coding, but I was wondering if you could help me fix it. It would be really helpful for me! I wrote and updated this code some time ago, and the plot object is available at https://github.com/rbonghi/jetson_stats/blob/master/jtop/gui/lib/chart.py

artur-ag commented 7 months ago

From what I understand, the issue is that there's only space for 4 characters in most places of the UI, but all values between 1000M and 1023M take 5 characters, and are truncated to 101M. Values higher than 1024 get correctly converted to 1.0G. This is not an issue just with the ASCII plot, but also in the text output.

One way of fixing it would be to start using the higher unit whenever the value goes above 999, even if the value in the higher unit is not above 1 of the larger unit. Like this:

@rbonghi I'm not familiar with the code, but from what I can tell, it's not chart.py making these conversions between units, but some other code before this, right? I see common.py has some unit conversion code. Maybe that's where this new behavior needs to be implemented (if you agree that this solution is adequate, that is).