kasmtech / workspaces-issues

18 stars 4 forks source link

Kasm database is unable to perform cleanup under heavy load #102

Open rmichak opened 2 years ago

rmichak commented 2 years ago

We have noticed that our kasm database becomes full under heavy load. Seems to be directly related to the amount of traffic/activity on the manager/server.

mmcclaskey commented 2 years ago

@rmichak, we have confirmed an issue with larger installations in the hundreds of users. Kasm uses the Postgres database for log storage and analysis, with a default debug log retention of 4 hours and all other logs 7 days. There is a cleanup routine that runs once per hour and deletes logs older than the configured retention policy. At a certain scale, deadlocks occur during the cleanup which causes an exception to be thrown. The cleanup routine then stops, avoiding potential outages or slow performance of a never ending cleanup. Unfortunately, that means logs are never cleared and your database server's volume eventually fills up.

Ultimately, enterprise deployments at large scale should completely disable the built in logging and instead enable the Splunk HEC forwarder or install other SIEM agents on the systems to forward the logs. Unfortunately, Kasm 1.10.0 and below does not allow you to completely disable the built in logging. The Kasm 1.11.0 release will allow you to disable logging debug logs and all other logs by setting the retention to 0 respectively. Additionally, the cleanup routine has been modified to avoid deadlocks and tuned to scale to support 2.8 million logs per hour. For installations with hundreds of users or more, we would still recommend disabling debug logs.

For now, larger installations can create a cron job to execute the following script every 15 minutes. This script has been developed to mimic the cleanup routine that will be in 1.11.0, thus allowing older versions to scale in the mean time. Before implementing this, you must truncate the database logs, which will delete all logs and return the space back to the OS. The cleanup script will create logs at /opt/kasm/current/log/db_clean.log which match the Kasm logging format.

# From the database server, truncate the logs table
sudo docker exec kasm_db psql kasm kasmapp -c 'TRUNCATE logs;'
# create a cron job that runs the new script every 15 minutes
*/15 * * * * /opt/kasm/current/bin/utils/db_clean

Create the new script /opt/kasm/current/bin/utils/db_clean with the following contents:

#!/bin/bash

# rotate logs
RW_CNT=$(wc -l /opt/kasm/current/log/db_clean.log | grep -Po '^\d+')
if (( RW_CNT > 17520 )); then
    mv /opt/kasm/current/log/db_clean.log /opt/kasm/current/log/db_clean.log.1
    echo "$(date '+%F %T,%3N') [DEBUG] db_clean: Logs rotated." | tee -a /opt/kasm/current/log/db_clean.log
fi

read -r -d '' CLEANUP_QUERY << EOM
DELETE FROM logs
WHERE
    (
        (select extract(epoch from max(ingest_date) - min(ingest_date))/3600 FROM logs)>(((SELECT value::integer FROM settings WHERE name = 'log_retention' limit 1) * 24) + 1)
        OR
        (select extract(epoch from max(ingest_date) - min(ingest_date))/60 FROM logs WHERE levelname = 'DEBUG')>((SELECT value::integer FROM settings WHERE name = 'debug_retention' limit 1) * 60 + 15)
    )
    AND
log_id IN
(
SELECT
    log_id
FROM logs
WHERE
    ingest_date < (SELECT now() - value::integer * INTERVAL '1 DAY' FROM settings WHERE name = 'log_retention') OR
    (
        ingest_date < (SELECT now() - value::integer * INTERVAL '1 HOUR' FROM settings WHERE name = 'debug_retention') AND
        levelname = 'DEBUG'
    )
LIMIT 700000
FOR UPDATE SKIP LOCKED
);
EOM
echo "$(date '+%F %T,%3N') [INFO] db_clean: Database cleanup routine started." | tee -a /opt/kasm/current/log/db_clean.log
output=$(docker exec kasm_db psql kasm kasmapp -c "$CLEANUP_QUERY")
ROWS_DELETED=$(echo "$output" | grep -Po '\d+')

if (( ROWS_DELETED >= 600000 )); then
   echo "$(date '+%F %T,%3N') [ERROR] db_clean: Log ingest volume is exceeding cleanup capacity." | tee -a /opt/kasm/current/log/db_clean.log
fi

echo "$(date '+%F %T,%3N') [INFO] db_clean: Database cleanup routine completed, $ROWS_DELETED rows deleted."  | tee -a /opt/kasm/current/log/db_clean.log