[Bug]: reading a chunk ends up in DataFileRead hanging, other chunks work fine

fordfrog commented 1 week ago

What type of bug is this?

Data corruption, Performance issue

What subsystems and features are affected?

Command processing, Partitioning

What happened?

i can't read from a single chunk in my database. other chunks work fine. anything (select, vacuum, compress, ...) that i try to do on the chunk ends up in DataFileRead hanging. that process can't then even be killed and the whole database has to be restarted, which does not work without issues because of the hanging process.

TimescaleDB version affected

2.15.2

PostgreSQL version used

16.3

What operating system did you use?

gentoo linux

What installation method did you use?

Source

What platform did you run on?

Not applicable

Relevant log output and stack trace

# grep 2_95 postgresql-*.csv 
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29568,,6671b827.7380,1,,2024-06-18 16:39:03 GMT,34/12760452,246338502,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813381,18) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29567,,6671b827.737f,1,,2024-06-18 16:39:03 GMT,33/4746705,246338503,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813378,22) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29572,,6671b827.7384,1,,2024-06-18 16:39:03 GMT,62/2136623,246338506,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813379,38) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29570,,6671b827.7382,1,,2024-06-18 16:39:03 GMT,53/8301273,246338509,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813380,26) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29569,,6671b827.7381,1,,2024-06-18 16:39:03 GMT,41/8912638,246338507,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813378,46) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29575,,6671b827.7387,1,,2024-06-18 16:39:03 GMT,71/1757346,246338505,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813380,34) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29574,,6671b827.7386,1,,2024-06-18 16:39:03 GMT,69/921512,246338504,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813383,49) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29573,,6671b827.7385,1,,2024-06-18 16:39:03 GMT,66/2498710,246338508,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813378,57) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 17:47:07.158 GMT,,,29566,,6671b827.737e,1,,2024-06-18 16:39:03 GMT,32/10371971,246338511,FATAL,57P01,"terminating background worker ""pg_background"" due to administrator command",,,,,"while updating tuple (813378,27) in relation ""_hyper_2_95_chunk""
postgresql-Tue.csv:2024-06-18 23:55:26.257 GMT,,,13543,,66721ad3.34e7,1,,2024-06-18 23:40:03 GMT,44/107414,0,LOG,00000,"automatic vacuum of table ""fxtrader._timescaledb_internal._hyper_2_95_chunk"": index scans: 1
postgresql-Tue.csv:index ""_hyper_2_95_chunk_ticks_bid_ask_ticker_timestamp_position_idx"": pages: 748048 in total, 0 newly deleted, 0 currently deleted, 0 reusable
postgresql-Tue.csv:index ""_hyper_2_95_chunk_ticks_bid_ask_timestamp_idx"": pages: 133248 in total, 0 newly deleted, 0 currently deleted, 0 reusable
postgresql-Wed.csv:2024-06-19 11:10:20.704 GMT,"fxtrader","fxtrader",10125,"[local]",6671ce47.278d,905,"SELECT",2024-06-18 18:13:27 GMT,24/793,246461755,LOG,00000,"acquiring locks for compressing ""_timescaledb_internal._hyper_2_95_chunk""",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Wed.csv:2024-06-19 11:27:39.741 GMT,"fxtrader","fxtrader",10125,"[local]",6671ce47.278d,906,"SELECT",2024-06-18 18:13:27 GMT,24/793,246461755,ERROR,57014,"canceling statement due to user request",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Wed.csv:2024-06-19 15:53:15.247 GMT,"fordfrog","fxtrader",15284,"[local]",6671cf50.3bb4,434,"SELECT",2024-06-18 18:17:52 GMT,25/1780,246492784,LOG,00000,"acquiring locks for compressing ""_timescaledb_internal._hyper_2_95_chunk""",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Wed.csv:2024-06-19 16:02:05.968 GMT,"fordfrog","fxtrader",15284,"[local]",6671cf50.3bb4,435,"SELECT",2024-06-18 18:17:52 GMT,25/1780,246492784,ERROR,57014,"canceling statement due to user request",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 14:39:13.084 GMT,"fxtrader","fxtrader",10125,"[local]",6671ce47.278d,909,"SELECT",2024-06-18 18:13:27 GMT,24/798,246641516,LOG,00000,"acquiring locks for compressing ""_timescaledb_internal._hyper_2_95_chunk""",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 14:45:08.230 GMT,"fxtrader","fxtrader",10125,"[local]",6671ce47.278d,910,"SELECT",2024-06-18 18:13:27 GMT,24/798,246641516,ERROR,57014,"canceling statement due to user request",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 14:46:31.797 GMT,"fxtrader","fxtrader",10125,"[local]",6671ce47.278d,911,"SELECT",2024-06-18 18:13:27 GMT,24/799,246642480,LOG,00000,"acquiring locks for compressing ""_timescaledb_internal._hyper_2_95_chunk""",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 14:58:27.132 GMT,,,3982,,6671cdab.f8e,7,,2024-06-18 18:10:51 GMT,,0,LOG,00000,"server process (PID 29899) was terminated by signal 9: Killed","Failed process was running: autovacuum: VACUUM _timescaledb_internal._hyper_2_95_chunk",,,,,,,,"","postmaster",,0
postgresql-Thu.csv:2024-06-20 14:59:53.181 GMT,"fordfrog","fxtrader",2520,"[local]",667443d9.9d8,5,"SELECT",2024-06-20 14:59:37 GMT,4/249,246643595,LOG,00000,"acquiring locks for compressing ""_timescaledb_internal._hyper_2_95_chunk""",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 14:59:53.182 GMT,"fordfrog","fxtrader",2520,"[local]",667443d9.9d8,6,"SELECT",2024-06-20 14:59:37 GMT,4/249,246643595,LOG,00000,"locks acquired for compressing ""_timescaledb_internal._hyper_2_95_chunk""",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 14:59:53.200 GMT,"fordfrog","fxtrader",2520,"[local]",667443d9.9d8,7,"SELECT",2024-06-20 14:59:37 GMT,4/249,246643595,LOG,00000,"new compressed chunk ""_timescaledb_internal.compress_hyper_3_174_chunk"" created",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 14:59:53.200 GMT,"fordfrog","fxtrader",2520,"[local]",667443d9.9d8,8,"SELECT",2024-06-20 14:59:37 GMT,4/249,246643595,LOG,00000,"using tuplesort to scan rows from ""_hyper_2_95_chunk"" for compression",,,,,,"select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 16:31:34.905 GMT,,,1294,,667443c3.50e,13,,2024-06-20 14:59:15 GMT,,0,LOG,00000,"server process (PID 2520) was terminated by signal 9: Killed","Failed process was running: select compress_chunk('_timescaledb_internal._hyper_2_95_chunk');",,,,,,,,"","postmaster",,0
postgresql-Thu.csv:2024-06-20 16:48:57.538 GMT,"fordfrog","fxtrader",30720,"[local]",66745c2c.7800,10,"SELECT",2024-06-20 16:43:24 GMT,20/154,0,ERROR,57014,"canceling statement due to user request",,,,,,"select * from _timescaledb_internal._hyper_2_95_chunk limit 1;",,,"psql","client backend",,0
postgresql-Thu.csv:2024-06-20 17:08:08.914 GMT,,,16505,,66745989.4079,7,,2024-06-20 16:32:09 GMT,,0,LOG,00000,"server process (PID 17417) was terminated by signal 9: Killed","Failed process was running: vacuum full verbose analyze _timescaledb_internal._hyper_2_95_chunk ;",,,,,,,,"","postmaster",,0

How can we reproduce the bug?

as other chunks work fine, i suspect some corruption and i have no idea how to replicate it. the log shows what was going on there with the chunk during the last week. i just recall today i noticed a vacuum hanging on the chunk for two days or so, so i killed that. but the chunk was probably already broken at that time.

EDIT: i just found out that dropping the broken chunk and re-inserting the data that belong to the chunk should recreate that chunk.

EDIT2: i managed to get rid of the chunk though the database got stuck again. and i restored the data from backup. still checking the data...

tjcutajar commented 1 week ago

We also ran into this DataFileRead hang issue a couple of days ago and had to perform a restore.

@fordfrog this might be unrelated, but did you upgrade to version 2.15.2 recently without running this script as per the release notes? Just wondering if this is somehow related.

fordfrog commented 1 week ago

@fordfrog this might be unrelated, but did you upgrade to version 2.15.2 recently without running this script as per the release notes? Just wondering if this is somehow related.

in fact 2.15.2 is the first version that i used (so i didn't use the script). i'm in the process of migrating table with close to 5 billions of ticks to hypertable with 71 chunks as of now, and so far this was the only issue i encountered.

fordfrog commented 1 week ago

also, not sure if that is important, but i have the database server replicated to another instance of postgresql and there the cluster was fine. so i used the cluster from the replicated database to recover the records (dump and then insert back to the production database after removing the broken chunk).

nikkhils commented 1 week ago

@fordfrog as can be seen in the logs, someone ran a VACUUM FULL on this chunk. VF will take an exclusive lock on the chunk and ANY other operation will not be able to do anything till VF completes.

Failed process was running: vacuum full verbose analyze _timescaledb_internal._hyper_2_95_chunk ;

Why was the VF command initiated? It's generally not a recommended practice to use VF.

fordfrog commented 1 week ago

@fordfrog as can be seen in the logs, someone ran a VACUUM FULL on this chunk. VF will take an exclusive lock on the chunk and ANY other operation will not be able to do anything till VF completes.

Failed process was running: vacuum full verbose analyze _timescaledb_internal._hyper_2_95_chunk ;

Why was the VF command initiated? It's generally not a recommended practice to use VF.

it was already after the chunk was broken and it was a try whether complete rewrite of the chunk helps or not... but it freezed, as any other operation on the chunk.

timescale / timescaledb