timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.
https://www.timescale.com/
Other
17.73k stars 885 forks source link

Segfault when drop_chunks with 1.7.1 #1986

Closed akamensky closed 4 years ago

akamensky commented 4 years ago

Relevant system information:

Describe the bug After the upgrade of Postgresql (10.9 -> 12.3) and TimescaleDB extension (1.6.1 -> 1.7.1) we see repeated segfaults when executing drop_chunks:

Under GDB the segfault yields:

Program received signal SIGSEGV, Segmentation fault.
0x00007fa710efc259 in cmp_slices_by_dimension_id () from /usr/pgsql-12/lib/timescaledb-1.7.1.so

To Reproduce Not certain how to reproduce it on any other setup. This happens currently in our staging environment.

Expected behavior Chunks dropped and disk space freed.

Actual behavior Segfault

akamensky commented 4 years ago

@mkindahl I checked which table those chunks belong to, they are all from same table. And we have a few log entries for this table in last few days:

2020-08-31 07:52:14.632 HKT [34841] LOG:  connection authorized: user=postgres database=db_name application_name=psql
2020-08-31 07:52:26.006 HKT [34841] WARNING:  unexpected state for chunk _timescaledb_internal._hyper_3_7451_chunk, dropping anyway
2020-08-31 07:52:26.006 HKT [34841] DETAIL:  The integrity of hypertable public.table_name might be compromised since one of its chunks lacked a dimension slice.
2020-08-31 07:52:26.006 HKT [34841] STATEMENT:  truncate table_name;
2020-08-31 07:52:28.412 HKT [34841] LOG:  duration: 5546.165 ms  statement: truncate table_name;
2020-09-01 10:00:01.473 HKT [21247] LOG:  connection authorized: user=postgres database=db_name application_name=psql
2020-09-01 10:00:01.524 HKT [8706] ERROR:  could not open relation with OID 2913115
2020-09-01 10:00:01.524 HKT [8706] STATEMENT:  INSERT INTO "public"."table_name"("col_1","col_2","col_3","col_4","col_5","col_6","col_7","col_8","col_9","col_10") VALUES($1,$2,$3,$4,$5,$6,$7,$8,$9,$10)

But there were no segfaults until now (so that was about ~100 drop_chunk calls on database). It looks like it does not segfault when there are missing dimension slices in chunks that are NOT to be deleted yet. But once chunks need to be deleted and have missing dimension slices -- segfault.

Hope that helps. Please let me know if I could provide any more information here

akamensky commented 4 years ago

Also I do think having debug symbols along with libraries would be very helpful as that would tell us exact line where the segfault happened. Right now there is no debug symbols in provided RPMs (nor there is separate -debug rpm, which is a common way to provide debug symbols separately).

akamensky commented 4 years ago

Any updates on this issue? It is still happening in a few of our environments.

akamensky commented 4 years ago

@erimatnor not sure why this is closed? As I posted above the issue is still happening on 1.7.3. We also upgraded to 1.7.4 but it appears the issue is still there.

erimatnor commented 4 years ago

@akamensky perhaps it was a mistake to close. The issue was targeted for 1.7.3 and there was a fix merged. I think it might also be a duplicate of https://github.com/timescale/timescaledb/issues/2140. If so, can we track progress in that issue? Otherwise, it is OK to reopen this one.

akamensky commented 4 years ago

@erimatnor I feel it is something different, we don't have error on drop_chunks. It either works or we get a segfault with (above) attached stack trace. The issue you referenced doesn't seem to have segfaults.

erimatnor commented 4 years ago

@akamensky OK, reopening. I think, however, that we need a bit more info to move forward on this issue. We need to be able to reproduce it locally. Can you provide the following information:

akamensky commented 4 years ago

Schema/table definition of the hypertable on which this happens

We have quite a large number of hypertables, it happens on different ones, but mostly on high-traffic ones. Unfortunately I can't really provide table definition as that would break compliance requirements. Highly obfuscated table definition for one of them was provided in #1841 please see if that is helpful.

create_hypertable command to create the hypertable

All of them are of form SELECT create_hypertable('public.table_name', 'timestamp_column', chunk_time_interval => interval '1 day'); some have 1 hour chunk_time_interval, but nothing else fancy happening there.

Concurrent process running in the system against the table (from previous information it sound as if there's one or more insert processes and one drop_chunks process)

As I already mentioned in this ticket the concurrent write process that is running in our system is Kafka Connect processes writing information from Kafka topics to Timescale tables. You should be able easily reproduce this setup on your side, we do have however quite high volume of Kafka messages -- on Kafka side it translates to about 350 MB/s (bytes, not bits) of inbound data, converted to inbound data on Timescale side that gets close to 40-50k rows per second inserted on average and about 200k per second at peak. Or translated to disk IO that is approximately 600 MB/s (bytes, not bits) disk writes on avg, at peak we are maxing out our current disk IO capabilities (we are really looking forward to distributed timescale solution...)

As for drop_chunks it is already been provided above. It is a cron entry in the form of:

# output redirect is needed to get cron emails when some error happens
psql -U postgres -d db_name -c "SELECT drop_chunks(interval '1 day');" 2>&1

Ideally a script or instructions to reproduce the crash. (Realize this might be hard given the likely concurrency-related nature of the issue)

We don't even know how to reliably reproduce it ourselves. From discussion above, once we have some missing dimension slices -- it will 100% segfault when calling drop_chunks. Other than doing high concurrency writes and waiting till this happens I don't even know how to get it to this state.

I think it is good to try to find a way to reproduce the error, but given the nature of error -- wouldn't it be better to look at where segfault is happening and eliminate it there? The stacktrace has already been provided above as well to identify where segfault is happening.

akamensky commented 4 years ago

@erimatnor @mkindahl i see the latest PR has been merged. Is there any ETA on it making to 1.7 patch release?

akamensky commented 3 years ago

@erimatnor @mkindahl this was closed, but the changes are unreleased. What is the current ETA on these changes making it to 1.7 branch?

mfreed commented 3 years ago

These changes are currently released as part of 2.0-rc2. Let me discuss with team if there are plans to cherry-pick them back on the 1.7.x branch at some point.

akamensky commented 3 years ago

@mfreed My understanding from comments https://github.com/timescale/timescaledb/pull/2514#issuecomment-705945990 was that this fix will be backported to 1.7 branch. We currently run 1.7 in multiple environments/systems and jumping to 2.x will be a lengthy upgrade process for us. Meantime this issue is still affecting us (since before this issue is reported).