timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.88k stars 882 forks source link

[Bug]: job crash detected, see server logs #7085

Open pgloader opened 4 months ago

pgloader commented 4 months ago

What type of bug is this?

Crash

What subsystems and features are affected?

Background worker

What happened?

job crash detected, see server logs but there was no information in the PostrgeSQL statement log TimescaleDB: 2.15.2 PostgreSQL: 16.3 log_min_error_statement: log

Besides, the message in the job_history would disappear

it was 5 minutes ago =# select * from job_errors; job_id | proc_schema | proc_name | pid | start_time | finish_time | sqlerrcode | err_message --------+------------------------+-------------------------------------+---------+-------------------------------+-------------------------------+------------+------------------------------------- 1003 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 09:06:39.752447-04 | 2024-06-28 09:06:39.752542-04 | | job crash detected, see server logs 1002 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 10:46:50.699781-04 | 2024-06-28 10:46:50.699857-04 | | job crash detected, see server logs 1025 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 06:30:00.006238-04 | 2024-07-01 06:30:00.006365-04 | | job crash detected, see server logs 1023 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 07:00:00.00073-04 | 2024-07-01 07:00:00.000763-04 | | job crash detected, see server logs (4 rows)

Now

select * from job_errors;

job_id | proc_schema | proc_name | pid | start_time | finish_time | sqlerrcode | err_message --------+------------------------+-------------------------------------+---------+-------------------------------+-------------------------------+------------+------------------------------------- 1003 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 09:06:39.752447-04 | 2024-06-28 09:06:39.752542-04 | | job crash detected, see server logs 1002 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 10:46:50.699781-04 | 2024-06-28 10:46:50.699857-04 | | job crash detected, see server logs 1025 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 06:30:00.006238-04 | 2024-07-01 06:30:00.006365-04 | | job crash detected, see server logs 1023 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 09:00:00.008187-04 | 2024-07-01 09:00:00.008324-04 | | job crash detected, see server logs (4 rows)

The message 2024-07-01 06:30:00.006238-04 was gone

TimescaleDB version affected

2.15.2

PostgreSQL version used

16.3

What operating system did you use?

RHEL8.6

What installation method did you use?

Source

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

#                                     select * from job_errors;
 job_id |      proc_schema       |              proc_name              |   pid   |          start_time           |          finish_time          | sqlerrcode |             err_message
--------+------------------------+-------------------------------------+---------+-------------------------------+-------------------------------+------------+-------------------------------------
   1003 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 09:06:39.752447-04 | 2024-06-28 09:06:39.752542-04 |            | job crash detected, see server logs
   1002 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 10:46:50.699781-04 | 2024-06-28 10:46:50.699857-04 |            | job crash detected, see server logs
   1025 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 06:30:00.006238-04 | 2024-07-01 06:30:00.006365-04 |            | job crash detected, see server logs
   1023 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 07:00:00.00073-04  | 2024-07-01 07:00:00.000763-04 |            | job crash detected, see server logs
(4 rows)

How can we reproduce the bug?

# select * from job_errors;
 job_id |      proc_schema       |              proc_name              |   pid   |          start_time           |          finish_time          | sqlerrcode |             err_message
--------+------------------------+-------------------------------------+---------+-------------------------------+-------------------------------+------------+-------------------------------------
   1003 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 09:06:39.752447-04 | 2024-06-28 09:06:39.752542-04 |            | job crash detected, see server logs
   1002 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 10:46:50.699781-04 | 2024-06-28 10:46:50.699857-04 |            | job crash detected, see server logs
   1025 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 06:30:00.006238-04 | 2024-07-01 06:30:00.006365-04 |            | job crash detected, see server logs
   1023 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 09:00:00.008187-04 | 2024-07-01 09:00:00.008324-04 |            | job crash detected, see server logs
(4 rows)
fabriziomello commented 3 months ago

@pgloader can u please copy/paste the result of the following query?

SELECT * FROM timescaledb_information.job_history ORDER BY start_time;
cosimomeli commented 3 months ago

Hi, I think I could have the same issues with background jobs. In my case, I have this issue with continuous aggregations:

2024-07-31 14:16:48 UTC [51]: [15-1] 66aa4740.33 0     LOG:  background worker "Refresh Continuous Aggregate Policy [1613]" (PID 222) was terminated by signal 11: Segmentation fault
2024-07-31 14:16:48 UTC [51]: [16-1] 66aa4740.33 0     DETAIL:  Failed process was running: CALL _timescaledb_functions.policy_refresh_continuous_aggregate()
2024-07-31 14:16:48 UTC [51]: [17-1] 66aa4740.33 0     LOG:  terminating any other active server processes

After the segfault, the whole Postgres restarts and enters recovery mode for a while. I'm using Timescale 2.15.3

This is the result of the query you asked.

id job_id succeeded proc_schema proc_name pid start_time finish_time config sqlerrcode err_message
1 1001 false _timescaledb_functions policy_compression 2672843 2000-01-01 01:00:00.000 +0100 2024-07-24 09:09:00.356 +0200 {"hypertable_id": 2, "compress_after": "01:00:00"} job crash detected, see server logs
2 1004 false _timescaledb_functions policy_compression 2672843 2000-01-01 01:00:00.000 +0100 2024-07-24 09:09:00.362 +0200 {"hypertable_id": 5, "compress_after": "01:00:00"} job crash detected, see server logs
3 1005 false _timescaledb_functions policy_compression 2672843 2000-01-01 01:00:00.000 +0100 2024-07-24 09:09:00.362 +0200 {"hypertable_id": 1, "compress_after": "01:00:00"} job crash detected, see server logs
4 1010 false _timescaledb_functions policy_retention 2672843 2000-01-01 01:00:00.000 +0100 2024-07-24 09:09:00.362 +0200 {"drop_after": "30 days", "hypertable_id": 5} job crash detected, see server logs
5 1025 false _timescaledb_functions policy_refresh_continuous_aggregate 2672843 2000-01-01 01:00:00.000 +0100 2024-07-24 09:09:00.362 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
6 1028 false _timescaledb_functions policy_refresh_continuous_aggregate 2672843 2000-01-01 01:00:00.000 +0100 2024-07-24 09:09:00.362 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
7 1029 false _timescaledb_functions policy_refresh_continuous_aggregate 2672843 2000-01-01 01:00:00.000 +0100 2024-07-24 09:09:00.362 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
34 1070 false _timescaledb_functions policy_retention 2673025 2000-01-01 01:00:00.000 +0100 2024-07-24 09:09:32.354 +0200 {"drop_after": "30 days", "hypertable_id": 5} job crash detected, see server logs
35 1085 false _timescaledb_functions policy_refresh_continuous_aggregate 2673025 2000-01-01 01:00:00.000 +0100 2024-07-24 09:09:32.355 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
36 1088 false _timescaledb_functions policy_refresh_continuous_aggregate 2673025 2000-01-01 01:00:00.000 +0100 2024-07-24 09:09:32.355 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
37 1089 false _timescaledb_functions policy_refresh_continuous_aggregate 2673025 2000-01-01 01:00:00.000 +0100 2024-07-24 09:09:32.355 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
38 1004 false _timescaledb_functions policy_compression 2698766 2000-01-01 01:00:00.000 +0100 2024-07-24 11:17:20.310 +0200 {"hypertable_id": 5, "compress_after": "01:00:00"} job crash detected, see server logs
39 1005 false _timescaledb_functions policy_compression 2698766 2000-01-01 01:00:00.000 +0100 2024-07-24 11:17:20.313 +0200 {"hypertable_id": 1, "compress_after": "01:00:00"} job crash detected, see server logs
40 1118 false _timescaledb_functions policy_refresh_continuous_aggregate 2698766 2000-01-01 01:00:00.000 +0100 2024-07-24 11:17:20.314 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
41 1121 false _timescaledb_functions policy_refresh_continuous_aggregate 2698766 2000-01-01 01:00:00.000 +0100 2024-07-24 11:17:20.314 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
42 1122 false _timescaledb_functions policy_refresh_continuous_aggregate 2698766 2000-01-01 01:00:00.000 +0100 2024-07-24 11:17:20.314 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
43 1123 false _timescaledb_functions policy_refresh_continuous_aggregate 2698766 2000-01-01 01:00:00.000 +0100 2024-07-24 11:17:20.314 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28} job crash detected, see server logs
71 1151 false _timescaledb_functions policy_refresh_continuous_aggregate 2698852 2000-01-01 01:00:00.000 +0100 2024-07-24 11:17:31.238 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
72 1154 false _timescaledb_functions policy_refresh_continuous_aggregate 2698852 2000-01-01 01:00:00.000 +0100 2024-07-24 11:17:31.239 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
73 1155 false _timescaledb_functions policy_refresh_continuous_aggregate 2698852 2000-01-01 01:00:00.000 +0100 2024-07-24 11:17:31.239 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
74 1156 false _timescaledb_functions policy_refresh_continuous_aggregate 2698852 2000-01-01 01:00:00.000 +0100 2024-07-24 11:17:31.239 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28} job crash detected, see server logs
104 1184 false _timescaledb_functions policy_refresh_continuous_aggregate 2698927 2000-01-01 01:00:00.000 +0100 2024-07-24 11:17:40.968 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
105 1187 false _timescaledb_functions policy_refresh_continuous_aggregate 2698927 2000-01-01 01:00:00.000 +0100 2024-07-24 11:17:40.969 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
106 1188 false _timescaledb_functions policy_refresh_continuous_aggregate 2698927 2000-01-01 01:00:00.000 +0100 2024-07-24 11:17:40.969 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
107 1189 false _timescaledb_functions policy_refresh_continuous_aggregate 2698927 2000-01-01 01:00:00.000 +0100 2024-07-24 11:17:40.969 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28} job crash detected, see server logs
108 1004 false _timescaledb_functions policy_compression 2771978 2000-01-01 01:00:00.000 +0100 2024-07-24 19:47:40.848 +0200 {"hypertable_id": 5, "compress_after": "01:00:00"} job crash detected, see server logs
109 1217 false _timescaledb_functions policy_refresh_continuous_aggregate 2771978 2000-01-01 01:00:00.000 +0100 2024-07-24 19:47:40.855 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
110 1220 false _timescaledb_functions policy_refresh_continuous_aggregate 2771978 2000-01-01 01:00:00.000 +0100 2024-07-24 19:47:40.855 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
111 1221 false _timescaledb_functions policy_refresh_continuous_aggregate 2771978 2000-01-01 01:00:00.000 +0100 2024-07-24 19:47:40.855 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
112 1222 false _timescaledb_functions policy_refresh_continuous_aggregate 2771978 2000-01-01 01:00:00.000 +0100 2024-07-24 19:47:40.855 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28} job crash detected, see server logs
141 1250 false _timescaledb_functions policy_refresh_continuous_aggregate 2772125 2000-01-01 01:00:00.000 +0100 2024-07-24 19:48:16.886 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
142 1253 false _timescaledb_functions policy_refresh_continuous_aggregate 2772125 2000-01-01 01:00:00.000 +0100 2024-07-24 19:48:16.886 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
143 1254 false _timescaledb_functions policy_refresh_continuous_aggregate 2772125 2000-01-01 01:00:00.000 +0100 2024-07-24 19:48:16.886 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
144 1255 false _timescaledb_functions policy_refresh_continuous_aggregate 2772125 2000-01-01 01:00:00.000 +0100 2024-07-24 19:48:16.886 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28} job crash detected, see server logs
145 1253 false _timescaledb_functions policy_refresh_continuous_aggregate 2957163 2000-01-01 01:00:00.000 +0100 2024-07-25 17:56:17.357 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
178 1283 false _timescaledb_functions policy_refresh_continuous_aggregate 2957254 2000-01-01 01:00:00.000 +0100 2024-07-25 17:56:28.657 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
179 1286 false _timescaledb_functions policy_refresh_continuous_aggregate 2957254 2000-01-01 01:00:00.000 +0100 2024-07-25 17:56:28.658 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
180 1287 false _timescaledb_functions policy_refresh_continuous_aggregate 2957254 2000-01-01 01:00:00.000 +0100 2024-07-25 17:56:28.658 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
181 1288 false _timescaledb_functions policy_refresh_continuous_aggregate 2957254 2000-01-01 01:00:00.000 +0100 2024-07-25 17:56:28.659 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28} job crash detected, see server logs
182 1286 false _timescaledb_functions policy_refresh_continuous_aggregate 3275070 2000-01-01 01:00:00.000 +0100 2024-07-27 17:11:53.277 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
183 1287 false _timescaledb_functions policy_refresh_continuous_aggregate 3427181 2000-01-01 01:00:00.000 +0100 2024-07-28 15:41:31.390 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
216 1316 false _timescaledb_functions policy_refresh_continuous_aggregate 3427265 2000-01-01 01:00:00.000 +0100 2024-07-28 15:41:41.907 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
217 1319 false _timescaledb_functions policy_refresh_continuous_aggregate 3427265 2000-01-01 01:00:00.000 +0100 2024-07-28 15:41:41.907 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
218 1320 false _timescaledb_functions policy_refresh_continuous_aggregate 3427265 2000-01-01 01:00:00.000 +0100 2024-07-28 15:41:41.907 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
219 1321 false _timescaledb_functions policy_refresh_continuous_aggregate 3427265 2000-01-01 01:00:00.000 +0100 2024-07-28 15:41:41.907 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28} job crash detected, see server logs
249 1349 false _timescaledb_functions policy_refresh_continuous_aggregate 3427339 2000-01-01 01:00:00.000 +0100 2024-07-28 15:41:51.902 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
250 1352 false _timescaledb_functions policy_refresh_continuous_aggregate 3427339 2000-01-01 01:00:00.000 +0100 2024-07-28 15:41:51.902 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
251 1353 false _timescaledb_functions policy_refresh_continuous_aggregate 3427339 2000-01-01 01:00:00.000 +0100 2024-07-28 15:41:51.902 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
252 1354 false _timescaledb_functions policy_refresh_continuous_aggregate 3427339 2000-01-01 01:00:00.000 +0100 2024-07-28 15:41:51.902 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28} job crash detected, see server logs
253 1349 false _timescaledb_functions policy_refresh_continuous_aggregate 3571752 2000-01-01 01:00:00.000 +0100 2024-07-29 13:04:27.233 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
254 1352 false _timescaledb_functions policy_refresh_continuous_aggregate 3746649 2000-01-01 01:00:00.000 +0100 2024-07-30 14:57:14.466 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
287 1382 false _timescaledb_functions policy_refresh_continuous_aggregate 3746753 2000-01-01 01:00:00.000 +0100 2024-07-30 14:57:30.255 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
288 1385 false _timescaledb_functions policy_refresh_continuous_aggregate 3746753 2000-01-01 01:00:00.000 +0100 2024-07-30 14:57:30.255 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
289 1386 false _timescaledb_functions policy_refresh_continuous_aggregate 3746753 2000-01-01 01:00:00.000 +0100 2024-07-30 14:57:30.255 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
290 1382 false _timescaledb_functions policy_refresh_continuous_aggregate 3889928 2000-01-01 01:00:00.000 +0100 2024-07-31 12:08:50.999 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
291 1386 false _timescaledb_functions policy_refresh_continuous_aggregate 3889928 2000-01-01 01:00:00.000 +0100 2024-07-31 12:08:51.001 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
292 1415 false _timescaledb_functions policy_refresh_continuous_aggregate 3894415 2000-01-01 01:00:00.000 +0100 2024-07-31 12:46:04.729 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
293 1418 false _timescaledb_functions policy_refresh_continuous_aggregate 3894415 2000-01-01 01:00:00.000 +0100 2024-07-31 12:46:04.730 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
294 1419 false _timescaledb_functions policy_refresh_continuous_aggregate 3894415 2000-01-01 01:00:00.000 +0100 2024-07-31 12:46:04.730 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
295 1420 false _timescaledb_functions policy_refresh_continuous_aggregate 3894415 2000-01-01 01:00:00.000 +0100 2024-07-31 12:46:04.730 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28} job crash detected, see server logs
296 1448 false _timescaledb_functions policy_refresh_continuous_aggregate 3907195 2000-01-01 01:00:00.000 +0100 2024-07-31 14:34:21.743 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
297 1451 false _timescaledb_functions policy_refresh_continuous_aggregate 3907195 2000-01-01 01:00:00.000 +0100 2024-07-31 14:34:21.743 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
298 1452 false _timescaledb_functions policy_refresh_continuous_aggregate 3907195 2000-01-01 01:00:00.000 +0100 2024-07-31 14:34:21.744 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
299 1453 false _timescaledb_functions policy_refresh_continuous_aggregate 3907195 2000-01-01 01:00:00.000 +0100 2024-07-31 14:34:21.744 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28} job crash detected, see server logs
329 1481 false _timescaledb_functions policy_refresh_continuous_aggregate 3907971 2000-01-01 01:00:00.000 +0100 2024-07-31 14:40:19.588 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
330 1484 false _timescaledb_functions policy_refresh_continuous_aggregate 3907971 2000-01-01 01:00:00.000 +0100 2024-07-31 14:40:19.589 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
331 1485 false _timescaledb_functions policy_refresh_continuous_aggregate 3907971 2000-01-01 01:00:00.000 +0100 2024-07-31 14:40:19.589 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
332 1486 false _timescaledb_functions policy_refresh_continuous_aggregate 3907971 2000-01-01 01:00:00.000 +0100 2024-07-31 14:40:19.589 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28} job crash detected, see server logs
333 1514 false _timescaledb_functions policy_refresh_continuous_aggregate 3909582 2000-01-01 01:00:00.000 +0100 2024-07-31 14:53:52.791 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
334 1517 false _timescaledb_functions policy_refresh_continuous_aggregate 3909582 2000-01-01 01:00:00.000 +0100 2024-07-31 14:53:52.791 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
335 1518 false _timescaledb_functions policy_refresh_continuous_aggregate 3909582 2000-01-01 01:00:00.000 +0100 2024-07-31 14:53:52.791 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
336 1519 false _timescaledb_functions policy_refresh_continuous_aggregate 3909582 2000-01-01 01:00:00.000 +0100 2024-07-31 14:53:52.791 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28} job crash detected, see server logs
337 1547 false _timescaledb_functions policy_refresh_continuous_aggregate 3912453 2000-01-01 01:00:00.000 +0100 2024-07-31 15:18:00.563 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
338 1550 false _timescaledb_functions policy_refresh_continuous_aggregate 3912453 2000-01-01 01:00:00.000 +0100 2024-07-31 15:18:00.563 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
339 1551 false _timescaledb_functions policy_refresh_continuous_aggregate 3912453 2000-01-01 01:00:00.000 +0100 2024-07-31 15:18:00.564 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
340 1552 false _timescaledb_functions policy_refresh_continuous_aggregate 3912453 2000-01-01 01:00:00.000 +0100 2024-07-31 15:18:00.564 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28} job crash detected, see server logs
370 1580 false _timescaledb_functions policy_refresh_continuous_aggregate 3912538 2000-01-01 01:00:00.000 +0100 2024-07-31 15:18:10.703 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
371 1583 false _timescaledb_functions policy_refresh_continuous_aggregate 3912538 2000-01-01 01:00:00.000 +0100 2024-07-31 15:18:10.704 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
372 1584 false _timescaledb_functions policy_refresh_continuous_aggregate 3912538 2000-01-01 01:00:00.000 +0100 2024-07-31 15:18:10.704 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
373 1585 false _timescaledb_functions policy_refresh_continuous_aggregate 3912538 2000-01-01 01:00:00.000 +0100 2024-07-31 15:18:10.704 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28} job crash detected, see server logs
374 1613 false _timescaledb_functions policy_refresh_continuous_aggregate 260 2000-01-01 01:00:00.000 +0100 2024-07-31 16:16:49.179 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18} job crash detected, see server logs
375 1616 false _timescaledb_functions policy_refresh_continuous_aggregate 260 2000-01-01 01:00:00.000 +0100 2024-07-31 16:16:49.182 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16} job crash detected, see server logs
376 1617 false _timescaledb_functions policy_refresh_continuous_aggregate 260 2000-01-01 01:00:00.000 +0100 2024-07-31 16:16:49.182 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24} job crash detected, see server logs
377 1618 false _timescaledb_functions policy_refresh_continuous_aggregate 260 2000-01-01 01:00:00.000 +0100 2024-07-31 16:16:49.182 +0200 {"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28} job crash detected, see server logs
fabriziomello commented 3 months ago

@cosimomeli looks like your background job for cagg refresh is leading to segfault then the job_errors output is correct. Would be great if you open another issue related to this segmentation for for further investigation.

pgloader commented 3 weeks ago

SELECT * FROM timescaledb_information.job_history ORDER BY start_time;

875 | 1005 | t | _timescaledb_functions | policy_compression | 1481534 | 2024-10-20 09:15:00.004445-04 | 2024-10-20 09:15:00.016282-04 | {"hypertable_id": 1, "compress_after": "14 days"} | | 876 | 1000 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1504011 | 2024-10-20 09:30:00.003659-04 | 2024-10-20 09:33:19.511167-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 2} | | 877 | 1016 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1504012 | 2024-10-20 09:30:00.004353-04 | 2024-10-20 09:32:59.072322-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 14} | | 878 | 1008 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1504013 | 2024-10-20 09:30:00.00464-04 | 2024-10-20 09:30:15.16198-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 8} | | 879 | 1021 | t | _timescaledb_functions | policy_compression | 1534815 | 2024-10-20 09:45:00.002833-04 | 2024-10-20 09:45:00.013852-04 | {"hypertable_id": 13, "compress_after": "14 days"} | | 880 | 2 | f | _timescaledb_functions | policy_job_error_retention | 1544833 | 2024-10-20 09:52:29.102847-04 | 2024-10-20 09:52:29.109349-04 | {"drop_after": "1 month"} | 42883 | function _timescaledb_functions.policy_job_error_retention(integer, jsonb) does not exist 881 | 1008 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1558331 | 2024-10-20 10:00:00.000606-04 | 2024-10-20 10:00:16.541826-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 8} | | 882 | 1016 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1558332 | 2024-10-20 10:00:00.001381-04 | 2024-10-20 10:03:01.800141-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 14} | | 883 | 1000 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1558333 | 2024-10-20 10:00:00.001708-04 | 2024-10-20 10:03:21.682105-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 2} | | 884 | 1008 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1612695 | 2024-10-20 10:30:00.004533-04 | 2024-10-20 10:30:16.592725-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 8} | | 885 | 1016 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1612696 | 2024-10-20 10:30:00.005279-04 | 2024-10-20 10:33:04.315277-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 14} | | 886 | 1000 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1612697 | 2024-10-20 10:30:00.005631-04 | 2024-10-20 10:33:31.534959-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 2} | | (886 rows)

These are the most recent

pgloader commented 3 weeks ago

@pgloader can u please copy/paste the result of the following query?

SELECT * FROM timescaledb_information.job_history ORDER BY start_time;

Please see the attached outouts crash.txt

antekresic commented 3 weeks ago

Could you take a look at your logs for these failures?

 825 |   1016 | f         | _timescaledb_functions | policy_refresh_continuous_aggregate | 3882667 | 2024-10-20 01:30:00.002859-04 | 2024-10-20 01:30:00.002877-04 |                                                                                 |            | job crash detected, see server logs
 826 |   1000 | f         | _timescaledb_functions | policy_refresh_continuous_aggregate | 3882667 | 2024-10-20 01:30:00.003178-04 | 2024-10-20 01:30:00.003195-04 |                                                                                 |            | job crash detected, see server logs

Some time around 2024-10-20 01:30:00.002877-04 should give us some log lines about what happened here.

Or do the logs not contain anything this time either?

Thanks.

pgloader commented 3 weeks ago

2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.472 EDT @ LOG: deleted 0 row(s) from materialization table "_timescaledb_internal._materialized_hypertable_8"

fabriziomello commented 3 weeks ago

2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.472 EDT @ LOG: deleted 0 row(s) from materialization table "_timescaledb_internal._materialized_hypertable_8"

Is that possible you to cut your logs from 2024-10-20 01:00:00.000 EDT to 2024-10-20 02:00:00.000 EDT and attach here? But please make sure there's no sensible information on it.

Also recently I've made some refactoring on the code that capture and record logs executions and exceptions and would be nice if you can try it out by updating the extension to the 2.17.1.

pgloader commented 3 weeks ago

2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.472 EDT @ LOG: deleted 0 row(s) from materialization table "_timescaledb_internal._materialized_hypertable_8"

Is that possible you to cut your logs from 2024-10-20 01:00:00.000 EDT to 2024-10-20 02:00:00.000 EDT and attach here? But please make sure there's no sensible information on it.

Also recently I've made some refactoring on the code that capture and record logs executions and exceptions and would be nice if you can try it out by updating the extension to the 2.17.1.

Please see the attached 0102.log

fabriziomello commented 3 weeks ago

2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.472 EDT @ LOG: deleted 0 row(s) from materialization table "_timescaledb_internal._materialized_hypertable_8"

Is that possible you to cut your logs from 2024-10-20 01:00:00.000 EDT to 2024-10-20 02:00:00.000 EDT and attach here? But please make sure there's no sensible information on it. Also recently I've made some refactoring on the code that capture and record logs executions and exceptions and would be nice if you can try it out by updating the extension to the 2.17.1.

Please see the attached 0102.log

A quick grep into your logs showed the following:

/tmp 
➜ grep -i 'Continuous Aggregate' 0102.log
2024-10-20 01:00:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:00:00-04, 2024-10-20 00:30:00-04 ]
2024-10-20 01:00:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:00:00-04, 2024-10-20 00:30:00-04 ]
2024-10-20 01:00:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:00:00-04, 2024-10-20 00:30:00-04 ]
2024-10-20 01:30:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ]
2024-10-20 01:30:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ]
2024-10-20 01:30:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ]
2024-10-20 01:30:10.012 EDT @ FATAL:  terminating background worker "Refresh Continuous Aggregate Policy [1000]" due to administrator command
2024-10-20 01:30:12.440 EDT @ LOG:  background worker "Refresh Continuous Aggregate Policy [1000]" (PID 621682) exited with exit code 1
2024-10-20 01:30:12.445 EDT @ FATAL:  terminating background worker "Refresh Continuous Aggregate Policy [1016]" due to administrator command
2024-10-20 01:30:15.075 EDT @ LOG:  background worker "Refresh Continuous Aggregate Policy [1016]" (PID 621681) exited with exit code 1

Looks like another process cancel the execution of the job so the error history is correct.

pgloader commented 3 weeks ago

Most likely conflicted with the pg_dump backup. Should we manually re-run these aggregations?

fabriziomello commented 2 weeks ago

Most likely conflicted with the pg_dump backup. Should we manually re-run these aggregations?

If you have successful executions after the failure then you're safe since the next execution will process all invalidation logs created even it it was before the window range executed by policy. The downside of it is that it will take more time to refresh because it have more buckets to aggregate.