pgloader commented 4 months ago

What type of bug is this?

Crash

What subsystems and features are affected?

Background worker

What happened?

job crash detected, see server logs but there was no information in the PostrgeSQL statement log TimescaleDB: 2.15.2 PostgreSQL: 16.3 log_min_error_statement: log

Besides, the message in the job_history would disappear

it was 5 minutes ago =# select * from job_errors; job_id | proc_schema | proc_name | pid | start_time | finish_time | sqlerrcode | err_message --------+------------------------+-------------------------------------+---------+-------------------------------+-------------------------------+------------+------------------------------------- 1003 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 09:06:39.752447-04 | 2024-06-28 09:06:39.752542-04 | | job crash detected, see server logs 1002 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 10:46:50.699781-04 | 2024-06-28 10:46:50.699857-04 | | job crash detected, see server logs 1025 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 06:30:00.006238-04 | 2024-07-01 06:30:00.006365-04 | | job crash detected, see server logs 1023 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 07:00:00.00073-04 | 2024-07-01 07:00:00.000763-04 | | job crash detected, see server logs (4 rows)

Now

select * from job_errors;

job_id | proc_schema | proc_name | pid | start_time | finish_time | sqlerrcode | err_message --------+------------------------+-------------------------------------+---------+-------------------------------+-------------------------------+------------+------------------------------------- 1003 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 09:06:39.752447-04 | 2024-06-28 09:06:39.752542-04 | | job crash detected, see server logs 1002 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 10:46:50.699781-04 | 2024-06-28 10:46:50.699857-04 | | job crash detected, see server logs 1025 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 06:30:00.006238-04 | 2024-07-01 06:30:00.006365-04 | | job crash detected, see server logs 1023 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 09:00:00.008187-04 | 2024-07-01 09:00:00.008324-04 | | job crash detected, see server logs (4 rows)

The message 2024-07-01 06:30:00.006238-04 was gone

TimescaleDB version affected

2.15.2

PostgreSQL version used

16.3

What operating system did you use?

RHEL8.6

What installation method did you use?

Source

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

#                                     select * from job_errors;
 job_id |      proc_schema       |              proc_name              |   pid   |          start_time           |          finish_time          | sqlerrcode |             err_message
--------+------------------------+-------------------------------------+---------+-------------------------------+-------------------------------+------------+-------------------------------------
   1003 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 09:06:39.752447-04 | 2024-06-28 09:06:39.752542-04 |            | job crash detected, see server logs
   1002 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 10:46:50.699781-04 | 2024-06-28 10:46:50.699857-04 |            | job crash detected, see server logs
   1025 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 06:30:00.006238-04 | 2024-07-01 06:30:00.006365-04 |            | job crash detected, see server logs
   1023 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 07:00:00.00073-04  | 2024-07-01 07:00:00.000763-04 |            | job crash detected, see server logs
(4 rows)

How can we reproduce the bug?

# select * from job_errors;
 job_id |      proc_schema       |              proc_name              |   pid   |          start_time           |          finish_time          | sqlerrcode |             err_message
--------+------------------------+-------------------------------------+---------+-------------------------------+-------------------------------+------------+-------------------------------------
   1003 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 09:06:39.752447-04 | 2024-06-28 09:06:39.752542-04 |            | job crash detected, see server logs
   1002 | _timescaledb_functions | policy_refresh_continuous_aggregate | 1116242 | 2024-06-28 10:46:50.699781-04 | 2024-06-28 10:46:50.699857-04 |            | job crash detected, see server logs
   1025 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 06:30:00.006238-04 | 2024-07-01 06:30:00.006365-04 |            | job crash detected, see server logs
   1023 | _timescaledb_functions | policy_refresh_continuous_aggregate | 2128427 | 2024-07-01 09:00:00.008187-04 | 2024-07-01 09:00:00.008324-04 |            | job crash detected, see server logs
(4 rows)

fabriziomello commented 3 months ago

@pgloader can u please copy/paste the result of the following query?

SELECT * FROM timescaledb_information.job_history ORDER BY start_time;

cosimomeli commented 3 months ago

Hi, I think I could have the same issues with background jobs. In my case, I have this issue with continuous aggregations:

2024-07-31 14:16:48 UTC [51]: [15-1] 66aa4740.33 0     LOG:  background worker "Refresh Continuous Aggregate Policy [1613]" (PID 222) was terminated by signal 11: Segmentation fault
2024-07-31 14:16:48 UTC [51]: [16-1] 66aa4740.33 0     DETAIL:  Failed process was running: CALL _timescaledb_functions.policy_refresh_continuous_aggregate()
2024-07-31 14:16:48 UTC [51]: [17-1] 66aa4740.33 0     LOG:  terminating any other active server processes

After the segfault, the whole Postgres restarts and enters recovery mode for a while. I'm using Timescale 2.15.3

This is the result of the query you asked.

id	job_id	succeeded	proc_schema	proc_name	pid	start_time	finish_time	config	sqlerrcode
1	1001	false	_timescaledb_functions	policy_compression	2672843	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:00.356 +0200	{"hypertable_id": 2, "compress_after": "01:00:00"}	job crash detected, see server logs
2	1004	false	_timescaledb_functions	policy_compression	2672843	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:00.362 +0200	{"hypertable_id": 5, "compress_after": "01:00:00"}	job crash detected, see server logs
3	1005	false	_timescaledb_functions	policy_compression	2672843	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:00.362 +0200	{"hypertable_id": 1, "compress_after": "01:00:00"}	job crash detected, see server logs
4	1010	false	_timescaledb_functions	policy_retention	2672843	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:00.362 +0200	{"drop_after": "30 days", "hypertable_id": 5}	job crash detected, see server logs
5	1025	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2672843	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:00.362 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
6	1028	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2672843	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:00.362 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
7	1029	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2672843	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:00.362 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
34	1070	false	_timescaledb_functions	policy_retention	2673025	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:32.354 +0200	{"drop_after": "30 days", "hypertable_id": 5}	job crash detected, see server logs
35	1085	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2673025	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:32.355 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
36	1088	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2673025	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:32.355 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
37	1089	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2673025	2000-01-01 01:00:00.000 +0100	2024-07-24 09:09:32.355 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
38	1004	false	_timescaledb_functions	policy_compression	2698766	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:20.310 +0200	{"hypertable_id": 5, "compress_after": "01:00:00"}	job crash detected, see server logs
39	1005	false	_timescaledb_functions	policy_compression	2698766	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:20.313 +0200	{"hypertable_id": 1, "compress_after": "01:00:00"}	job crash detected, see server logs
40	1118	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698766	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:20.314 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
41	1121	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698766	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:20.314 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
42	1122	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698766	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:20.314 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
43	1123	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698766	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:20.314 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
71	1151	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698852	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:31.238 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
72	1154	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698852	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:31.239 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
73	1155	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698852	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:31.239 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
74	1156	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698852	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:31.239 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
104	1184	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698927	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:40.968 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
105	1187	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698927	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:40.969 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
106	1188	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698927	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:40.969 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
107	1189	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2698927	2000-01-01 01:00:00.000 +0100	2024-07-24 11:17:40.969 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
108	1004	false	_timescaledb_functions	policy_compression	2771978	2000-01-01 01:00:00.000 +0100	2024-07-24 19:47:40.848 +0200	{"hypertable_id": 5, "compress_after": "01:00:00"}	job crash detected, see server logs
109	1217	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2771978	2000-01-01 01:00:00.000 +0100	2024-07-24 19:47:40.855 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
110	1220	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2771978	2000-01-01 01:00:00.000 +0100	2024-07-24 19:47:40.855 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
111	1221	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2771978	2000-01-01 01:00:00.000 +0100	2024-07-24 19:47:40.855 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
112	1222	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2771978	2000-01-01 01:00:00.000 +0100	2024-07-24 19:47:40.855 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
141	1250	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2772125	2000-01-01 01:00:00.000 +0100	2024-07-24 19:48:16.886 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
142	1253	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2772125	2000-01-01 01:00:00.000 +0100	2024-07-24 19:48:16.886 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
143	1254	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2772125	2000-01-01 01:00:00.000 +0100	2024-07-24 19:48:16.886 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
144	1255	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2772125	2000-01-01 01:00:00.000 +0100	2024-07-24 19:48:16.886 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
145	1253	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2957163	2000-01-01 01:00:00.000 +0100	2024-07-25 17:56:17.357 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
178	1283	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2957254	2000-01-01 01:00:00.000 +0100	2024-07-25 17:56:28.657 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
179	1286	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2957254	2000-01-01 01:00:00.000 +0100	2024-07-25 17:56:28.658 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
180	1287	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2957254	2000-01-01 01:00:00.000 +0100	2024-07-25 17:56:28.658 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
181	1288	false	_timescaledb_functions	policy_refresh_continuous_aggregate	2957254	2000-01-01 01:00:00.000 +0100	2024-07-25 17:56:28.659 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
182	1286	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3275070	2000-01-01 01:00:00.000 +0100	2024-07-27 17:11:53.277 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
183	1287	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427181	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:31.390 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
216	1316	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427265	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:41.907 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
217	1319	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427265	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:41.907 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
218	1320	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427265	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:41.907 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
219	1321	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427265	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:41.907 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
249	1349	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427339	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:51.902 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
250	1352	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427339	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:51.902 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
251	1353	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427339	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:51.902 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
252	1354	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3427339	2000-01-01 01:00:00.000 +0100	2024-07-28 15:41:51.902 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
253	1349	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3571752	2000-01-01 01:00:00.000 +0100	2024-07-29 13:04:27.233 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
254	1352	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3746649	2000-01-01 01:00:00.000 +0100	2024-07-30 14:57:14.466 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
287	1382	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3746753	2000-01-01 01:00:00.000 +0100	2024-07-30 14:57:30.255 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
288	1385	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3746753	2000-01-01 01:00:00.000 +0100	2024-07-30 14:57:30.255 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
289	1386	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3746753	2000-01-01 01:00:00.000 +0100	2024-07-30 14:57:30.255 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
290	1382	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3889928	2000-01-01 01:00:00.000 +0100	2024-07-31 12:08:50.999 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
291	1386	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3889928	2000-01-01 01:00:00.000 +0100	2024-07-31 12:08:51.001 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
292	1415	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3894415	2000-01-01 01:00:00.000 +0100	2024-07-31 12:46:04.729 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
293	1418	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3894415	2000-01-01 01:00:00.000 +0100	2024-07-31 12:46:04.730 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
294	1419	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3894415	2000-01-01 01:00:00.000 +0100	2024-07-31 12:46:04.730 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
295	1420	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3894415	2000-01-01 01:00:00.000 +0100	2024-07-31 12:46:04.730 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
296	1448	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3907195	2000-01-01 01:00:00.000 +0100	2024-07-31 14:34:21.743 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
297	1451	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3907195	2000-01-01 01:00:00.000 +0100	2024-07-31 14:34:21.743 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
298	1452	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3907195	2000-01-01 01:00:00.000 +0100	2024-07-31 14:34:21.744 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
299	1453	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3907195	2000-01-01 01:00:00.000 +0100	2024-07-31 14:34:21.744 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
329	1481	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3907971	2000-01-01 01:00:00.000 +0100	2024-07-31 14:40:19.588 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
330	1484	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3907971	2000-01-01 01:00:00.000 +0100	2024-07-31 14:40:19.589 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
331	1485	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3907971	2000-01-01 01:00:00.000 +0100	2024-07-31 14:40:19.589 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
332	1486	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3907971	2000-01-01 01:00:00.000 +0100	2024-07-31 14:40:19.589 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
333	1514	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3909582	2000-01-01 01:00:00.000 +0100	2024-07-31 14:53:52.791 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
334	1517	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3909582	2000-01-01 01:00:00.000 +0100	2024-07-31 14:53:52.791 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
335	1518	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3909582	2000-01-01 01:00:00.000 +0100	2024-07-31 14:53:52.791 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
336	1519	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3909582	2000-01-01 01:00:00.000 +0100	2024-07-31 14:53:52.791 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
337	1547	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3912453	2000-01-01 01:00:00.000 +0100	2024-07-31 15:18:00.563 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
338	1550	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3912453	2000-01-01 01:00:00.000 +0100	2024-07-31 15:18:00.563 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
339	1551	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3912453	2000-01-01 01:00:00.000 +0100	2024-07-31 15:18:00.564 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
340	1552	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3912453	2000-01-01 01:00:00.000 +0100	2024-07-31 15:18:00.564 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
370	1580	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3912538	2000-01-01 01:00:00.000 +0100	2024-07-31 15:18:10.703 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
371	1583	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3912538	2000-01-01 01:00:00.000 +0100	2024-07-31 15:18:10.704 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
372	1584	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3912538	2000-01-01 01:00:00.000 +0100	2024-07-31 15:18:10.704 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
373	1585	false	_timescaledb_functions	policy_refresh_continuous_aggregate	3912538	2000-01-01 01:00:00.000 +0100	2024-07-31 15:18:10.704 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs
374	1613	false	_timescaledb_functions	policy_refresh_continuous_aggregate	260	2000-01-01 01:00:00.000 +0100	2024-07-31 16:16:49.179 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 18}	job crash detected, see server logs
375	1616	false	_timescaledb_functions	policy_refresh_continuous_aggregate	260	2000-01-01 01:00:00.000 +0100	2024-07-31 16:16:49.182 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 16}	job crash detected, see server logs
376	1617	false	_timescaledb_functions	policy_refresh_continuous_aggregate	260	2000-01-01 01:00:00.000 +0100	2024-07-31 16:16:49.182 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 24}	job crash detected, see server logs
377	1618	false	_timescaledb_functions	policy_refresh_continuous_aggregate	260	2000-01-01 01:00:00.000 +0100	2024-07-31 16:16:49.182 +0200	{"end_offset": "00:00:00", "start_offset": "28 days", "mat_hypertable_id": 28}	job crash detected, see server logs

fabriziomello commented 3 months ago

@cosimomeli looks like your background job for cagg refresh is leading to segfault then the job_errors output is correct. Would be great if you open another issue related to this segmentation for for further investigation.

pgloader commented 3 weeks ago

SELECT * FROM timescaledb_information.job_history ORDER BY start_time;

875 | 1005 | t | _timescaledb_functions | policy_compression | 1481534 | 2024-10-20 09:15:00.004445-04 | 2024-10-20 09:15:00.016282-04 | {"hypertable_id": 1, "compress_after": "14 days"} | | 876 | 1000 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1504011 | 2024-10-20 09:30:00.003659-04 | 2024-10-20 09:33:19.511167-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 2} | | 877 | 1016 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1504012 | 2024-10-20 09:30:00.004353-04 | 2024-10-20 09:32:59.072322-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 14} | | 878 | 1008 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1504013 | 2024-10-20 09:30:00.00464-04 | 2024-10-20 09:30:15.16198-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 8} | | 879 | 1021 | t | _timescaledb_functions | policy_compression | 1534815 | 2024-10-20 09:45:00.002833-04 | 2024-10-20 09:45:00.013852-04 | {"hypertable_id": 13, "compress_after": "14 days"} | | 880 | 2 | f | _timescaledb_functions | policy_job_error_retention | 1544833 | 2024-10-20 09:52:29.102847-04 | 2024-10-20 09:52:29.109349-04 | {"drop_after": "1 month"} | 42883 | function _timescaledb_functions.policy_job_error_retention(integer, jsonb) does not exist 881 | 1008 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1558331 | 2024-10-20 10:00:00.000606-04 | 2024-10-20 10:00:16.541826-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 8} | | 882 | 1016 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1558332 | 2024-10-20 10:00:00.001381-04 | 2024-10-20 10:03:01.800141-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 14} | | 883 | 1000 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1558333 | 2024-10-20 10:00:00.001708-04 | 2024-10-20 10:03:21.682105-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 2} | | 884 | 1008 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1612695 | 2024-10-20 10:30:00.004533-04 | 2024-10-20 10:30:16.592725-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 8} | | 885 | 1016 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1612696 | 2024-10-20 10:30:00.005279-04 | 2024-10-20 10:33:04.315277-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 14} | | 886 | 1000 | t | _timescaledb_functions | policy_refresh_continuous_aggregate | 1612697 | 2024-10-20 10:30:00.005631-04 | 2024-10-20 10:33:31.534959-04 | {"end_offset": "00:01:00", "start_offset": "01:01:00", "mat_hypertable_id": 2} | | (886 rows)

These are the most recent

pgloader commented 3 weeks ago

@pgloader can u please copy/paste the result of the following query?
SELECT * FROM timescaledb_information.job_history ORDER BY start_time;

Please see the attached outouts crash.txt

antekresic commented 3 weeks ago

Could you take a look at your logs for these failures?

 825 |   1016 | f         | _timescaledb_functions | policy_refresh_continuous_aggregate | 3882667 | 2024-10-20 01:30:00.002859-04 | 2024-10-20 01:30:00.002877-04 |                                                                                 |            | job crash detected, see server logs
 826 |   1000 | f         | _timescaledb_functions | policy_refresh_continuous_aggregate | 3882667 | 2024-10-20 01:30:00.003178-04 | 2024-10-20 01:30:00.003195-04 |                                                                                 |            | job crash detected, see server logs

Some time around 2024-10-20 01:30:00.002877-04 should give us some log lines about what happened here.

Or do the logs not contain anything this time either?

Thanks.

pgloader commented 3 weeks ago

2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.472 EDT @ LOG: deleted 0 row(s) from materialization table "_timescaledb_internal._materialized_hypertable_8"

fabriziomello commented 3 weeks ago

2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.472 EDT @ LOG: deleted 0 row(s) from materialization table "_timescaledb_internal._materialized_hypertable_8"

Is that possible you to cut your logs from 2024-10-20 01:00:00.000 EDT to 2024-10-20 02:00:00.000 EDT and attach here? But please make sure there's no sensible information on it.

Also recently I've made some refactoring on the code that capture and record logs executions and exceptions and would be nice if you can try it out by updating the extension to the 2.17.1.

pgloader commented 3 weeks ago

2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.472 EDT @ LOG: deleted 0 row(s) from materialization table "_timescaledb_internal._materialized_hypertable_8"

Is that possible you to cut your logs from 2024-10-20 01:00:00.000 EDT to 2024-10-20 02:00:00.000 EDT and attach here? But please make sure there's no sensible information on it.

Also recently I've made some refactoring on the code that capture and record logs executions and exceptions and would be nice if you can try it out by updating the extension to the 2.17.1.

Please see the attached 0102.log

fabriziomello commented 3 weeks ago

2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.010 EDT @ LOG: continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ] 2024-10-20 01:30:00.472 EDT @ LOG: deleted 0 row(s) from materialization table "_timescaledb_internal._materialized_hypertable_8"

Is that possible you to cut your logs from 2024-10-20 01:00:00.000 EDT to 2024-10-20 02:00:00.000 EDT and attach here? But please make sure there's no sensible information on it. Also recently I've made some refactoring on the code that capture and record logs executions and exceptions and would be nice if you can try it out by updating the extension to the 2.17.1.

Please see the attached 0102.log

A quick grep into your logs showed the following:

/tmp 
➜ grep -i 'Continuous Aggregate' 0102.log
2024-10-20 01:00:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:00:00-04, 2024-10-20 00:30:00-04 ]
2024-10-20 01:00:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:00:00-04, 2024-10-20 00:30:00-04 ]
2024-10-20 01:00:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:00:00-04, 2024-10-20 00:30:00-04 ]
2024-10-20 01:30:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_300" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ]
2024-10-20 01:30:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_60" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ]
2024-10-20 01:30:00.010 EDT @ LOG:  continuous aggregate refresh (individual invalidation) on "cag_30m_metric_data_10" in window [ 2024-10-20 00:30:00-04, 2024-10-20 01:00:00-04 ]
2024-10-20 01:30:10.012 EDT @ FATAL:  terminating background worker "Refresh Continuous Aggregate Policy [1000]" due to administrator command
2024-10-20 01:30:12.440 EDT @ LOG:  background worker "Refresh Continuous Aggregate Policy [1000]" (PID 621682) exited with exit code 1
2024-10-20 01:30:12.445 EDT @ FATAL:  terminating background worker "Refresh Continuous Aggregate Policy [1016]" due to administrator command
2024-10-20 01:30:15.075 EDT @ LOG:  background worker "Refresh Continuous Aggregate Policy [1016]" (PID 621681) exited with exit code 1

Looks like another process cancel the execution of the job so the error history is correct.

pgloader commented 3 weeks ago

Most likely conflicted with the pg_dump backup. Should we manually re-run these aggregations?

fabriziomello commented 2 weeks ago

Most likely conflicted with the pg_dump backup. Should we manually re-run these aggregations?

If you have successful executions after the failure then you're safe since the next execution will process all invalidation logs created even it it was before the window range executed by policy. The downside of it is that it will take more time to refresh because it have more buckets to aggregate.

timescale / timescaledb

[Bug]: job crash detected, see server logs #7085

What type of bug is this?

What subsystems and features are affected?

What happened?

select * from job_errors;

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?