timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.56k stars 881 forks source link

[Bug]: Crashed jobs' next_start time is not updated in job_stats table #5531

Open lkshminarayanan opened 1 year ago

lkshminarayanan commented 1 year ago

What type of bug is this?

Other

What subsystems and features are affected?

User-Defined Action (UDA)

What happened?

When a UDA crashes after receiving terminate command from postmaster, its next_start time is not updated in the job_stats table. Although the value is not updated in the job_stats table, the scheduler internally calculates the next_start time (which includes an additional 5 minute backoff as the UDA crashed) and runs the job at that next_start time.

The next_start time is reported as -infinity after a crash and this might be confusing to users :

postgres=# select job_id, job_status, last_run_status, next_start, total_runs, total_successes, total_failures from timescaledb_information.job_stats order by job_id;
 job_id | job_status | last_run_status |            next_start            | total_runs | total_successes | total_failures 
--------+------------+-----------------+----------------------------------+------------+-----------------+----------------
      1 | Scheduled  | Success         | 2023-04-06 17:46:40.441404+05:30 |          1 |               1 |              0
      2 | Scheduled  | Success         | 2023-05-01 05:30:00+05:30        |          1 |               1 |              0
   1000 | Scheduled  |                 | -infinity                        |          1 |               0 |              0
(3 rows)

TimescaleDB version affected

main (517dee9f6bf9c)

PostgreSQL version used

15.2

What operating system did you use?

Ubuntu 22.04

What installation method did you use?

Source

What platform did you run on?

On prem/Self-hosted

How can we reproduce the bug?

One way to crash the job is to shutdown the postmaster and restart it when the job is running.

Run the following SQL setup to create a job :

CREATE TABLE test (a int);
INSERT INTO test VALUES (1);

CREATE OR REPLACE PROCEDURE update_test(job_id INT, config JSONB)
LANGUAGE PLPGSQL AS
$$
BEGIN
  RAISE LOG 'Executing job %', job_id;
  UPDATE test SET a = a + 1;
  RAISE LOG 'Job % succeeded', job_id;
END
$$;
SELECT add_job('update_test', '5s');

-- The following transaction will make the UDA to wait for the locks
-- during which we can stop the postmaster to reproduce the issue
BEGIN;
UPDATE test SET a = a + 1;

Verify that the UDA is in running state by looking at the job_stats table or the postmaster logs

SELECT job_id, job_status, last_run_status,
       next_start, total_runs, total_successes, total_failures
  FROM timescaledb_information.job_stats ORDER BY job_id;

Now stop and restart the postgres server :

pg_ctl -D $DATA_DIR stop

pg_ctl -D $DATA_DIR -o "-h 127.0.0.1 -p $PGPORT -cshared_preload_libraries='timescaledb'"  start

Observe the wrong next_start time in job_stats table :

SELECT job_id, job_status, last_run_status,
       next_start, total_runs, total_successes, total_failures
  FROM timescaledb_information.job_stats ORDER BY job_id;

Relevant log output and stack trace

Note : This log is related to the example steps provided in the previous section.

When the job crashes due to postmaster shutting down, it emits the following log :

2023-04-06 16:46:53.147 IST [21715] LOG:  Executing job 1000
2023-04-06 16:46:53.147 IST [21715] CONTEXT:  PL/pgSQL function update_test(integer,jsonb) line 3 at RAISE
2023-04-06 16:47:15.583 IST [21697] LOG:  received fast shutdown request
2023-04-06 16:47:15.592 IST [21697] LOG:  aborting any active transactions
2023-04-06 16:47:15.592 IST [21703] FATAL:  terminating background worker "TimescaleDB Background Worker Launcher" due to administrator command
2023-04-06 16:47:15.592 IST [21710] FATAL:  terminating connection due to administrator command
2023-04-06 16:47:15.592 IST [21708] FATAL:  terminating background worker "TimescaleDB Background Worker Scheduler" due to administrator command
2023-04-06 16:47:15.592 IST [21715] FATAL:  terminating background worker "User-Defined Action [1000]" due to administrator command
2023-04-06 16:47:15.592 IST [21715] CONTEXT:  while updating tuple (0,1) in relation "test"
    SQL statement "UPDATE test SET a = a + 1"
    PL/pgSQL function update_test(integer,jsonb) line 4 at SQL statement
2023-04-06 16:47:15.616 IST [21697] LOG:  background worker "logical replication launcher" (PID 21704) exited with exit code 1
2023-04-06 16:47:15.632 IST [21697] LOG:  background worker "TimescaleDB Background Worker Launcher" (PID 21703) exited with exit code 1
2023-04-06 16:47:15.644 IST [21697] LOG:  background worker "User-Defined Action [1000]" (PID 21715) exited with exit code 1
2023-04-06 16:47:15.647 IST [21697] LOG:  background worker "TimescaleDB Background Worker Scheduler" (PID 21708) exited with exit code 1
2023-04-06 16:47:15.647 IST [21698] LOG:  shutting down

Log during restart :

2023-04-06 16:47:25.781 IST [21752] LOG:  starting PostgreSQL 15.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 12.2.1 20230111, 64-bit
2023-04-06 16:47:25.781 IST [21752] LOG:  listening on IPv4 address "127.0.0.1", port 5432
2023-04-06 16:47:25.783 IST [21752] LOG:  listening on Unix socket "/tmp/.s.PGSQL.5432"
2023-04-06 16:47:25.788 IST [21755] LOG:  database system was shut down at 2023-04-06 16:47:15 IST
2023-04-06 16:47:25.813 IST [21752] LOG:  database system is ready to accept connections
2023-04-06 16:47:25.816 IST [21758] LOG:  TimescaleDB background worker launcher connected to shared catalogs

Note that the job is not run yet, if run it will emit the Executing job message.

Once restarted, look at the output of job_stats table - observe the invalid next_start time :

postgres=# select job_id, job_status, last_run_status, next_start, total_runs, total_successes, total_failures from timescaledb_information.job_stats order by job_id;
 job_id | job_status | last_run_status |            next_start            | total_runs | total_successes | total_failures 
--------+------------+-----------------+----------------------------------+------------+-----------------+----------------
      1 | Scheduled  | Success         | 2023-04-06 17:46:40.441404+05:30 |          1 |               1 |              0
      2 | Scheduled  | Success         | 2023-05-01 05:30:00+05:30        |          1 |               1 |              0
   1000 | Scheduled  |                 | -infinity                        |          1 |               0 |              0
(3 rows)

Despite this invalid next_start time in job_stats, you can observe that the job is run exactly after 5 mins after the restart by inspecting the logs :

2023-04-06 16:52:25.858 IST [21935] LOG:  Executing job 1000
2023-04-06 16:52:25.858 IST [21935] CONTEXT:  PL/pgSQL function update_test(integer,jsonb) line 3 at RAISE
2023-04-06 16:52:25.859 IST [21935] LOG:  Job 1000 succeeded
2023-04-06 16:52:25.859 IST [21935] CONTEXT:  PL/pgSQL function update_test(integer,jsonb) line 5 at RAISE

(5 mins as the crashed jobs have to backoff for 5 minutes)

lkshminarayanan commented 1 year ago

Issue #5537 has a reproducible case that doesn't involve restarting the server.

fabriziomello commented 1 year ago

Issue #5537 fixed it.

lkshminarayanan commented 1 year ago

Hi @fabriziomello, Issue #5537 only fixes the crash that was happening due to a problem with the job sorting logic. The problem mentioned in this issue, i.e. the next_start is not being reflected properly in the _timescaledb_internal.bgw_job_stat table when a job crashes due to a server shutdown, still exists.