grondo / sqlog

SLURM job completion log database and query tool
GNU General Public License v2.0
9 stars 5 forks source link

The sqlog package contains a set of scripts useful for creating, populating, and issuing queries to a SLURM job log database.

COMPONENTS

sqlog The "SLURM Query Log" utility. Provides a single interface to query jobs from the SLURM job log database and/or current queue of running jobs.

slurm-joblog Logs completed jobs using SLURM's jobcomp/script interface to the SLURM job log database and an optional text file.

sqlog-db-util Administrative utility used to create SLURM job log database and its corresponding users. Also provides an interface to "backfill" the database using existing SLURM joblog files created by the jobcomp/filetxt plugin.

sqlog.conf World-readable config file. Contains local configuration for SQL host, read-only user, and read-only password.

slurm-joblog.conf Private configuration for slurm-joblog script (also used by by sqlog-db-util). Contains SQL read-write user and password, root user passwd (for sqlog-db-util) and a list of hosts that should have RW access to DB.

CONFIGURATION

For fully-automated operation, both the /etc/slurm/sqlog.conf and /etc/slurm/slurm-joblog.conf must exist. These files are read using perl's do() function, so the files can and must be valid perl. This allows a bit of scripting to get the values if necessary. (See the sqlog doc directory for examples).

The available variables in each config file include:

sqlog.conf:

SQLHOST SQL server hostname (default = sqlhost) SQLUSER Read-only user (default = slurm_read) SQLPASS Read-only password (default = none) SQLDB DB name (default = slurm) TRACKNODES Set to 0 to disable per-job node tracking (default = 1) %FORMATS Hash of format aliases (e.g. "f1" => "jid,name,user,state")

slurm-joblog.conf:

SQLUSER Read-write user (default = slurm) SQLPASS Read-write password (not set) SQLROOTPASS DB root password (not set) @SQLRWHOSTS Read-write hosts (array of hosts to give rw access) JOBLOGFILE txt joblog location (set if you want to log to a file too) AUTOCREATE Attempt to create DB if it doesn't yet exist the first time slurm-joblog is run (default = no).

CREATING JOB LOG DATABASE

Once the config files exist, the following command will create the SLURM job log database:

sqlog-db-util --create

If you have existing text joblog files you'd like to seed the new DB with, use

sqlog-db-util --backfill [FILE]...

e.g.

sqlog-db-util --backfill /var/log/slurm/joblog*

If AUTOCREATE is set in slurm-joblog.conf, then sqlog-db-util --create will be automatically run the first time the database is accessed.

CONVERTING JOB LOG DATABASE

The database schema changed from v0.12 to v0.13 of the sqlog package. The highest schema version currently running on a system can be determined from the --info output.

To create tables for the new schema, run:

sqlog-db-util --create

Once created, the slurm-joblog.pl script will detect the new schema and automatically switch to insert records to the new tables. The sqlog command will query both schemas for records.

To copy existing data from the old schema to the new schema, use the --convert option.

Speeding up the conversion: The new schema tracks the nodes that each job uses so that sqlog queries involving nodes names return much faster. The data and indicies associated with this node tracking can significantly slow down the conversion operation when converting a large number of records. There are two options to speed this up:

1) Disable node-tracking for all converted jobs via the --notrack option.
2) Delay indexing of converted data via the --delay-index option.

With the --notrack option, no node-tracking data will be stored for jobs inserted via conversion. As such, if node-tracking is enabled on the system, such jobs will not return in queries involving node names. Newly inserted jobs will still have node-tracking data.

With the --delay-index option, node tracking indicies are removed before data is converted, and they are restored when the conversion completes. Queries involving node names while there are no indicies will take a very long time to return on a large database.

For a database on Atlas, which had 580,000 jobs spanning two years the conversion took:

13 minutes for:  sqlog-db-util --convert --notrack
33 minutes for:  sqlog-db-util --convert --delay-index
85 minutes for:  sqlog-db-util --convert

The recommended method is to use --delay-index.

It's also possible to disable node-tracking in the new schema completely. To do this, add the following line to the sqlog.conf file.

$TRACKNODES=0;

Number of allocated cores: The new schema adds a new field to record the number of cores allocated to a job. This data was not captured in the version 1 schema. However, on many systems, this core count can be computed. On systems that have the same number of cores per node and allocate whole nodes to a job, one may use the --cores-per-node option to specify the number of cores per node. This --cores-per-node value is multiplied with the node count recorded in the version 1 schema to determine the number of cores allocated to the job. For example, to convert from schema version 1 to version 2 on a machine that has 8 cores per node and allocates whole nodes to jobs, run the following command:

sqlog-db-util --convert --cores-per-node=8

For all other systems, do not specify --cores-per-node. In this case, the number of cores allocated will be set to 0. The conversion command on these systems is simply:

sqlog-db-util --convert

If a mistake is made during conversion, you can drop the version 2 tables and start from scratch (be very careful to specify '2' and not '1' here):

sqlog-db-util --drop=2

You may issue the --convert command on a live system, however, be careful to specify the command correctly in this case. The slurm-joblog.pl script will insert records to the new schema as soon as it is created. If a mistake is made during conversion, and the version 2 tables must be dropped and recreated, any records inserted by slurm-joblog.pl will be lost.

After conversion, sqlog may report duplicate records as it finds matches from both the version 1 and version 2 tables. Once converted, it's recommended that the version 1 tables be dropped by running the following command (be very careful to specify '1' and not '2' here):

sqlog-db-util --drop=1

Finally, here is a full example set of commands to create the new schema and convert records to it:

sqlog-db-util -v --create sqlog-db-util -v --backup=all schema1_jobs.log sqlog-db-util -v --convert --delay-index --cores-per-node=8 sqlog-db-util -v --drop=1

BACKING UP AND PRUNING THE DATABASE

It is possible to dump records from the job log database into a text file, which can then be read in via --backfill. This is useful to capture a text file backup of the logs. One must specify the time period as either "all", "DATE", or "DATE..DATE", to dump all jobs, jobs before a given date, and jobs that started between two dates, respectively. DATE should be specified with the 'YYYY-MM-DD HH:MM:SS' format, e.g.,

sqlog-db-util -v --backup='2009-01-01 00:00:00'..'2009-02-01 00:00:00'\ logs.txt

One utility of this backup option is to share job log records with others potentially outside of the organization. Typically, one would like to protect user and job names when sharing such information. For this, an --obfuscate option is available which dumps records and modifies user names to be of the form "user_X", userids to match "X", and job names to be of the form "job_Y", where X and Y are numbers.

Finally, over a long period of time, the database may gather so many records that is slows down significantly. A --prune option is available to remove old records. One specifies a date, and all jobs which started before that date will be removed from the database and written to a file name specified by the user, e.g.,

sqlog-db-util -v --prune='2007-01-01 00:00:00' pre2007.log

ENABLE JOB LOGGING

To enable the SLURM job log database, the following configuration options must be set in slurm.conf:

JobCompType = jobcomp/script JobCompLoc = /usr/libexec/sqlog/slurm-joblog

Adjust the path if the sqlog RPM was installed with a different PREFIX. This has only been tested on SLURM 1.2.10 or greater.

Restart slurmctld and slurm-joblog will begin logging jobs as they complete.

$Id$