pydantic / logfire

Uncomplicated Observability for Python and beyond! 🪵🔥
https://logfire.pydantic.dev/docs/
MIT License
2.2k stars 66 forks source link

System Instrumentation: How to use properly #599

Open satwikkansal opened 1 week ago

satwikkansal commented 1 week ago

Question

I want to use logfire to push some system as well as process metrics, however it feels like the documentation could be more complete.

Looking at the documentation, I added up this code

import logfire
from dotenv import load_dotenv

import time

load_dotenv()

# System-wide metrics (monitors entire system)
system_metrics = {
    # CPU metrics for whole system
    'system.cpu.simple_utilization': None,
    # System memory usage
    'system.memory.utilization': ['available', 'used'],
    # Disk I/O for all processes
    'system.disk.io': ['read', 'write'],
    # Network I/O for all processes
    'system.network.io': ['transmit', 'receive'],
    # System swap usage
    'system.swap.utilization': ['used']
}

# Process-specific metrics (only for your Python application)
process_metrics = {
    # CPU usage of this Python process
    'process.runtime.cpu.utilization': None,
    # Memory usage of this Python process
    'process.runtime.memory': ['rss', 'vms'],
    # Thread count of this Python process
    'process.runtime.thread_count': None,
    # File descriptors opened by this process
    'process.open_file_descriptor.count': None
}

logfire.configure()

while True:
    logfire.instrument_system_metrics(system_metrics, base=None)
    logfire.instrument_system_metrics(process_metrics, base=None)
    time.sleep(60)

My goal was

Is the above way the right way to do so?

While running this code, I get couple of issues

  1. First of all, a warning of sorts
Attempting to instrument while already instrumented
An instrument with name process.runtime.cpython.cpu.utilization, type ObservableGauge, unit 1 and description Runtime CPU utilization has been created already.

I believe this is occurring because of the while loop, but then again if I don't have the while loop the script just starts and shuts and all I see on my dashboard is single data point.

  1. And a bunch of errors
Callback failed for instrument system.swap.utilization.
Traceback (most recent call last):
  File "/Users/satwik/code/ongoing/cq/venv/lib/python3.11/site-packages/opentelemetry/sdk/metrics/_internal/instrument.py", line 136, in callback
    for api_measurement in callback(callback_options):
  File "/Users/satwik/code/ongoing/cq/venv/lib/python3.11/site-packages/opentelemetry/instrumentation/system_metrics/__init__.py", line 500, in _get_system_swap_utilization
    for metric in self._config["system.swap.utilization"]:
                  ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'system.swap.utilization'
Callback failed for instrument system.disk.io.
Traceback (most recent call last):
  File "/Users/satwik/code/ongoing/cq/venv/lib/python3.11/site-packages/opentelemetry/sdk/metrics/_internal/instrument.py", line 136, in callback
    for api_measurement in callback(callback_options):
  File "/Users/satwik/code/ongoing/cq/venv/lib/python3.11/site-packages/opentelemetry/instrumentation/system_metrics/__init__.py", line 517, in _get_system_disk_io
    for metric in self._config["system.disk.io"]:
                  ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'system.disk.io'
Callback failed for instrument system.network.io.
Traceback (most recent call last):
  File "/Users/satwik/code/freelance/cq/venv/lib/python3.11/site-packages/opentelemetry/sdk/metrics/_internal/instrument.py", line 136, in callback
    for api_measurement in callback(callback_options):
  File "/Users/satwik/code/ongoing/cq/venv/lib/python3.11/site-packages/opentelemetry/instrumentation/system_metrics/__init__.py", line 629, in _get_system_network_io
    for metric in self._config["system.network.dropped.packets"]:
                  ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'system.network.dropped.packets'
^CTraceback (most recent call last):
  File "/Users/satwik/code/ongoing/cq/system_metrics.py", line 40, in <module>
    time.sleep(60)
KeyboardInterrupt
Callback failed for instrument system.memory.utilization.
Traceback (most recent call last):
  File "/Users/satwik/code/ongoing/cq/venv/lib/python3.11/site-packages/opentelemetry/sdk/metrics/_internal/instrument.py", line 136, in callback
    for api_measurement in callback(callback_options):
  File "/Users/satwik/code/ongoing/cq/venv/lib/python3.11/site-packages/opentelemetry/instrumentation/system_metrics/__init__.py", line 472, in _get_system_memory_utilization
    for metric in self._config["system.memory.utilization"]:
                  ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'system.memory.utilization'

Am I missing any step or doing something incorrectly?

alexmojaki commented 1 week ago

logfire.instrument_system_metrics must only be called once. It sets up a loop in a background thread which exports metrics every 60 seconds, and once at the end of the process. The only reason to use a loop is to keep the process alive if it's doing nothing else, e.g.:

logfire.instrument_system_metrics()

while True:
    time.sleep(60)

I want to have a separate process altogether to monitor system-wide metrics

I don't know if you really need this as opposed to just also exporting system-wide metrics from your main application processes. But if you do, then the two calls to logfire.instrument_system_metrics will be in separate processes so there won't be a problem. If you have a process whose only job is to report system-wide metrics then it's not really useful to measure its own process metrics.

If you want to instrument both process and system metrics within a single process, then call instrument_system_metrics once with a single dict combining both.

satwikkansal commented 1 week ago

Thanks!

Any ideas about the errors below

Traceback (most recent call last):
  File "/Users/satwik/code/freelance/ongoing/cq/loee/venv/lib/python3.11/site-packages/opentelemetry/sdk/metrics/_internal/instrument.py", line 136, in callback
    for api_measurement in callback(callback_options):
  File "/Users/satwik/code/freelance/ongoing/cq/loee/venv/lib/python3.11/site-packages/opentelemetry/instrumentation/system_metrics/__init__.py", line 629, in _get_system_network_io
    for metric in self._config["system.network.dropped.packets"]:
                  ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'system.network.dropped.packets'
Callback failed for instrument system.swap.utilization.
Traceback (most recent call last):
  File "/Users/satwik/code/freelance/ongoing/cq/loee/venv/lib/python3.11/site-packages/opentelemetry/sdk/metrics/_internal/instrument.py", line 136, in callback
    for api_measurement in callback(callback_options):
  File "/Users/satwik/code/freelance/ongoing/cq/loee/venv/lib/python3.11/site-packages/opentelemetry/instrumentation/system_metrics/__init__.py", line 500, in _get_system_swap_utilization
    for metric in self._config["system.swap.utilization"]:
                  ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'system.swap.utilization'
Callback failed for instrument system.disk.io.
Traceback (most recent call last):
  File "/Users/satwik/code/freelance/ongoing/cq/loee/venv/lib/python3.11/site-packages/opentelemetry/sdk/metrics/_internal/instrument.py", line 136, in callback
    for api_measurement in callback(callback_options):
  File "/Users/satwik/code/freelance/ongoing/cq/loee/venv/lib/python3.11/site-packages/opentelemetry/instrumentation/system_metrics/__init__.py", line 517, in _get_system_disk_io
    for metric in self._config["system.disk.io"]:
                  ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'system.disk.io'
Callback failed for instrument system.network.io.

I still get them

alexmojaki commented 1 week ago

Thanks!

Any ideas about the errors below

Traceback (most recent call last):
  File "/Users/satwik/code/freelance/ongoing/cq/loee/venv/lib/python3.11/site-packages/opentelemetry/sdk/metrics/_internal/instrument.py", line 136, in callback
    for api_measurement in callback(callback_options):
  File "/Users/satwik/code/freelance/ongoing/cq/loee/venv/lib/python3.11/site-packages/opentelemetry/instrumentation/system_metrics/__init__.py", line 629, in _get_system_network_io
    for metric in self._config["system.network.dropped.packets"]:
                  ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'system.network.dropped.packets'

Reported https://github.com/open-telemetry/opentelemetry-python-contrib/issues/3005

Callback failed for instrument system.swap.utilization.
Traceback (most recent call last):
  File "/Users/satwik/code/freelance/ongoing/cq/loee/venv/lib/python3.11/site-packages/opentelemetry/sdk/metrics/_internal/instrument.py", line 136, in callback
    for api_measurement in callback(callback_options):
  File "/Users/satwik/code/freelance/ongoing/cq/loee/venv/lib/python3.11/site-packages/opentelemetry/instrumentation/system_metrics/__init__.py", line 500, in _get_system_swap_utilization
    for metric in self._config["system.swap.utilization"]:
                  ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'system.swap.utilization'
Callback failed for instrument system.disk.io.
Traceback (most recent call last):
  File "/Users/satwik/code/freelance/ongoing/cq/loee/venv/lib/python3.11/site-packages/opentelemetry/sdk/metrics/_internal/instrument.py", line 136, in callback
    for api_measurement in callback(callback_options):
  File "/Users/satwik/code/freelance/ongoing/cq/loee/venv/lib/python3.11/site-packages/opentelemetry/instrumentation/system_metrics/__init__.py", line 517, in _get_system_disk_io
    for metric in self._config["system.disk.io"]:
                  ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
KeyError: 'system.disk.io'
Callback failed for instrument system.network.io.

This is not the same kind of mismatch, I can't reproduce these errors if I only call instrument_system_metrics once. What code did you run?

alexmojaki commented 1 week ago

Added a docs label for us to make it clearer that instrument_system_metrics should only be called once.

satwikkansal commented 1 week ago
import logfire
from dotenv import load_dotenv

import time

load_dotenv()

# System-wide metrics (monitors entire system)
system_metrics = {
    # CPU metrics for whole system
    'system.cpu.simple_utilization': None,
    # System memory usage
    'system.memory.utilization': ['available', 'used'],
    # Disk I/O for all processes
    'system.disk.io': ['read', 'write'],
    # Network I/O for all processes
    'system.network.io': ['transmit', 'receive'],
    # System swap usage
    'system.swap.utilization': ['used']
}

# Process-specific metrics (only for your Python application)
process_metrics = {
    # CPU usage of this Python process
    'process.runtime.cpu.utilization': None,
    # Memory usage of this Python process
    'process.runtime.memory': ['rss', 'vms'],
    # Thread count of this Python process
    'process.runtime.thread_count': None,
    # File descriptors opened by this process
    'process.open_file_descriptor.count': None
}

logfire.configure()
logfire.instrument_system_metrics(system_metrics, base=None)
# logfire.instrument_system_metrics(process_metrics, base=None)

while True:
    # needed to keep the process alive
    time.sleep(60)

This is my code, you've to probably wait for a couple of minutes for the errors to start showing up.

Operating system: I'm using MacOS 14.1.1, M1 chipset Logfire version: Tried on both 1.01 and 2.3.0

alexmojaki commented 1 week ago

That only gives me KeyError: 'system.network.dropped.packets'

satwikkansal commented 5 days ago

Yes, you're correct, I might have been instrumenting both system_metrics and process_metrics thinking they're mutually exclusive. It's just the KeyError: 'system.network.dropped.packets' error if I just call instrument_system_metrics once.