Feature Request: Log Memory-Related Metrics (CPU, GPU, RAM) During ML-Agents Training
Description
When using the mlagents-learn command to train agents, there is currently no built-in way to log system memory-related metrics (CPU usage, GPU usage, RAM usage) during training. Tracking these metrics is important for understanding system performance and optimizing the training process.
Proposed Solution
Since we don’t have direct access to the training loop while running mlagents-learn, I propose the following workaround to log memory-related metrics in real time:
Steps
1. Install Required Libraries
Install the necessary libraries for monitoring system resources:
pip install psutil GPUtil
2. Create a Separate Monitoring Script
This script will run in parallel with the ML-Agents training process and log system memory metrics.
import psutil
import GPUtil
import time
def log_system_metrics():
log_interval = 5 # Log every 5 seconds
with open("memory_metrics.log", "w") as log_file:
log_file.write("Time,CPU_Usage(%),RAM_Usage(%),GPU_Usage(%)\n")
while True:
# CPU and RAM usage
cpu_usage = psutil.cpu_percent(interval=1)
ram_usage = psutil.virtual_memory().percent
# GPU usage
gpus = GPUtil.getGPUs()
gpu_usage = gpus[0].load * 100 if gpus else 0
# Log metrics
log_file.write(f"{time.time()},{cpu_usage},{ram_usage},{gpu_usage}\n")
print(f"CPU: {cpu_usage}% | RAM: {ram_usage}% | GPU: {gpu_usage}%")
time.sleep(log_interval) # Adjust log frequency if needed
log_system_metrics()
3. Run the Monitoring Script Alongside ML-Agents Training
This feature would allow real-time logging and tracking of memory metrics without interfering with the ML-Agents training process. Having the ability to monitor system resource usage can provide valuable insights into system performance during model training.
Additional Context
This workaround avoids modifying the training loop within ML-Agents and provides flexibility in how users can monitor their hardware resource usage during training. It’s particularly useful for larger-scale projects where system resource optimization is critical.
Feature Request: Log Memory-Related Metrics (CPU, GPU, RAM) During ML-Agents Training
Description
When using the
mlagents-learn
command to train agents, there is currently no built-in way to log system memory-related metrics (CPU usage, GPU usage, RAM usage) during training. Tracking these metrics is important for understanding system performance and optimizing the training process.Proposed Solution
Since we don’t have direct access to the training loop while running
mlagents-learn
, I propose the following workaround to log memory-related metrics in real time:Steps
1. Install Required Libraries
Install the necessary libraries for monitoring system resources:
2. Create a Separate Monitoring Script
This script will run in parallel with the ML-Agents training process and log system memory metrics.
3. Run the Monitoring Script Alongside ML-Agents Training
• Start the ML-Agents training:
• Run the monitoring script in a separate terminal:
Optional: TensorBoard Logging
For users who prefer visualizing metrics in TensorBoard, modify the script to log these metrics there as well.
Conclusion
This feature would allow real-time logging and tracking of memory metrics without interfering with the ML-Agents training process. Having the ability to monitor system resource usage can provide valuable insights into system performance during model training.
Additional Context
This workaround avoids modifying the training loop within ML-Agents and provides flexibility in how users can monitor their hardware resource usage during training. It’s particularly useful for larger-scale projects where system resource optimization is critical.