Feature Request: Log Memory-Related Metrics (CPU, GPU, RAM) During ML-Agents Training

Description

When using the mlagents-learn command to train agents, there is currently no built-in way to log system memory-related metrics (CPU usage, GPU usage, RAM usage) during training. Tracking these metrics is important for understanding system performance and optimizing the training process.

Proposed Solution

Since we don’t have direct access to the training loop while running mlagents-learn, I propose the following workaround to log memory-related metrics in real time:

Steps

1. Install Required Libraries

Install the necessary libraries for monitoring system resources:

pip install psutil GPUtil

2. Create a Separate Monitoring Script

This script will run in parallel with the ML-Agents training process and log system memory metrics.

import psutil
import GPUtil
import time

def log_system_metrics():
    log_interval = 5  # Log every 5 seconds
    with open("memory_metrics.log", "w") as log_file:
        log_file.write("Time,CPU_Usage(%),RAM_Usage(%),GPU_Usage(%)\n")
        while True:
            # CPU and RAM usage
            cpu_usage = psutil.cpu_percent(interval=1)
            ram_usage = psutil.virtual_memory().percent

            # GPU usage
            gpus = GPUtil.getGPUs()
            gpu_usage = gpus[0].load * 100 if gpus else 0

            # Log metrics
            log_file.write(f"{time.time()},{cpu_usage},{ram_usage},{gpu_usage}\n")
            print(f"CPU: {cpu_usage}% | RAM: {ram_usage}% | GPU: {gpu_usage}%")
            time.sleep(log_interval)  # Adjust log frequency if needed

log_system_metrics()

3. Run the Monitoring Script Alongside ML-Agents Training

• Start the ML-Agents training:

mlagents-learn config/ppo/Custom_SoccerTwos.yaml --run-id=Soccer_twos_ppo_1 --no-graphics --env=builds/soccer_twos_v1/SoccerTwos.app

• Run the monitoring script in a separate terminal:

python log_memory_metrics.py

Optional: TensorBoard Logging

For users who prefer visualizing metrics in TensorBoard, modify the script to log these metrics there as well.

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter(log_dir="results/memory_metrics")

def log_system_metrics():
    step = 0
    while True:
        cpu_usage = psutil.cpu_percent(interval=1)
        ram_usage = psutil.virtual_memory().percent
        gpus = GPUtil.getGPUs()
        gpu_usage = gpus[0].load * 100 if gpus else 0

        # Log to TensorBoard
        writer.add_scalar("CPU Usage", cpu_usage, step)
        writer.add_scalar("RAM Usage", ram_usage, step)
        writer.add_scalar("GPU Usage", gpu_usage, step)

        print(f"CPU: {cpu_usage}% | RAM: {ram_usage}% | GPU: {gpu_usage}%")
        time.sleep(5)
        step += 1

    writer.close()

log_system_metrics()

Conclusion

This feature would allow real-time logging and tracking of memory metrics without interfering with the ML-Agents training process. Having the ability to monitor system resource usage can provide valuable insights into system performance during model training.

Additional Context

This workaround avoids modifying the training loop within ML-Agents and provides flexibility in how users can monitor their hardware resource usage during training. It’s particularly useful for larger-scale projects where system resource optimization is critical.

huypham37 / AIML-UM-14