Open lkshminarayanan opened 1 year ago
Issue #5537 has a reproducible case that doesn't involve restarting the server.
Issue #5537 fixed it.
Hi @fabriziomello, Issue #5537 only fixes the crash that was happening due to a problem with the job sorting logic. The problem mentioned in this issue, i.e. the next_start
is not being reflected properly in the _timescaledb_internal.bgw_job_stat
table when a job crashes due to a server shutdown, still exists.
What type of bug is this?
Other
What subsystems and features are affected?
User-Defined Action (UDA)
What happened?
When a UDA crashes after receiving terminate command from postmaster, its
next_start
time is not updated in thejob_stats
table. Although the value is not updated in thejob_stats
table, the scheduler internally calculates thenext_start
time (which includes an additional 5 minute backoff as the UDA crashed) and runs the job at thatnext_start
time.The
next_start
time is reported as-infinity
after a crash and this might be confusing to users :TimescaleDB version affected
main (517dee9f6bf9c)
PostgreSQL version used
15.2
What operating system did you use?
Ubuntu 22.04
What installation method did you use?
Source
What platform did you run on?
On prem/Self-hosted
How can we reproduce the bug?
One way to crash the job is to shutdown the postmaster and restart it when the job is running.
Run the following SQL setup to create a job :
Verify that the UDA is in running state by looking at the job_stats table or the postmaster logs
Now stop and restart the postgres server :
Observe the wrong
next_start
time injob_stats
table :Relevant log output and stack trace
Note : This log is related to the example steps provided in the previous section.
When the job crashes due to postmaster shutting down, it emits the following log :
Log during restart :
Note that the job is not run yet, if run it will emit the
Executing job
message.Once restarted, look at the output of job_stats table - observe the invalid next_start time :
Despite this invalid
next_start
time injob_stats
, you can observe that the job is run exactly after 5 mins after the restart by inspecting the logs :(5 mins as the crashed jobs have to backoff for 5 minutes)