thediymaker / slurm-node-dashboard

Slurm HPC node status page
GNU General Public License v3.0
2 stars 1 forks source link
dashboard hpc hpc-clusters slurm slurm-cluster slurm-job-scheduler

HPC Dashboard

License: GNU Node.js Next.js Tailwind CSS Shadcn

Powerful monitoring for your SLURM-based HPC cluster

The HPC Dashboard is a Next.js application designed to provide comprehensive monitoring of SLURM nodes. With a focus on performance and usability, this dashboard offers real-time insights into your HPC resources.

Dashboard Screenshot

Key Features

Core Functionality - Real-time monitoring of CPU and GPU node utilization - Detailed individual node status - Comprehensive Slurm job details and history - Dynamic data updates with refresh countdown
Advanced Integrations Enable these features by configuring your environment file: - LMOD module display and details - Prometheus metrics integration - OpenAI-powered insights

Quick Start

git clone https://github.com/thediymaker/slurm-node-dashboard.git
cd slurm-node-dashboard
npm install
# Set up your .env file (see Configuration section)
npm run dev

Visit http://localhost:3000 to see your dashboard in action.

Detailed Setup

Prerequisites - Node.js (v18 or later) - npm or Yarn - PM2 (for production deployment) - Slurm API (enabled and configured) - Slurm API token
Enabling the Slurm API To use this dashboard, you need to have the Slurm API enabled on your HPC cluster. Follow these steps to set it up: 1. Start by reviewing the [Schedmd quickstart guide](https://slurm.schedmd.com/rest_quickstart.html). 2. Ensure that `slurmrestd` is running on your cluster. 3. Once the Slurm API is running, you need to generate an API key for authentication. ### Generating an API Key The API key needs permissions to read all data. Here's an example of generating a key for the slurm user with a lifespan of 1 year: ```bash scontrol token username=slurm lifespan=31536000 ``` Note: This generates a JWT token. You can view the expiration date on the token and set up a reminder to renew it, or automate the renewal process (even with a shorter timeframe). The expiration of this token will be added to the future admin section on the dashboard.
Configuration Create a `.env` file in the root directory: ```env COMPANY_NAME="Your Company" CLUSTER_NAME="Your Cluster" CLUSTER_LOGO="/path/to/logo.png" NEXT_PUBLIC_BASE_URL="http://your-domain.com" # Optional integrations PROMETHEUS_URL="" OPENAI_API_KEY="" # Slurm configuration SLURM_API_VERSION="v0.0.40" SLURM_SERVER="http://your-slurm-server:port" SLURM_API_TOKEN="your-slurm-api-token" # Development settings NODE_ENV="production" REACT_EDITOR="code" ```
Production Deployment For production environments, we recommend using PM2: ```bash npm install -g pm2 pm2 start npm --name "hpc-dashboard" -- start pm2 save ``` This ensures your dashboard runs continuously and restarts automatically if the server reboots.

Advanced Usage

Custom Data Collection ### Historical Node Data Collect historical node data with this script (run hourly via cron): ```bash #!/bin/bash SAVE_DIR="/path/to/data/directory" mkdir -p "$SAVE_DIR" FILENAME=$(date +"%Y-%m-%dT%H-%M-%S.000Z.json.gz") curl -s "http://localhost:3000/api/slurm/nodes" | gzip > "$SAVE_DIR/$FILENAME" find "$SAVE_DIR" -name "*.json.gz" -type f -mtime +30 -delete ``` ### Module Data Collect module data with this script (run daily via cron): ```bash #!/bin/bash json_dir="/path/to/public/directory" json_output="${json_dir}/modules.json" mkdir -p "$json_dir" export MODULESHOME="/usr/share/lmod/lmod" export MODULEPATH="/your/module/path" $LMOD_DIR/spider -o jsonSoftwarePage $MODULEPATH | python -m json.tool > "$json_output" ```
Open OnDemand Integration To integrate this dashboard with Open OnDemand: Clone the generic Ruby app template: ``` git clone https://github.com/thediymaker/ood-status-iframe.git ``` Navigate to the cloned repository: ``` cd ood-status-iframe ``` Open the views/layout.erb file in your preferred text editor. Update the URL in the views/layout.erb file to point to your deployed HPC Dashboard: erb ```