[Bug]: Compression policy job keeps on hanging using up lots of cpu

timescale / timescaledb

An open-source time-series SQL database optimized for fast ingest and complex queries. Packaged as a PostgreSQL extension.

https://www.timescale.com/

Other

17.82k stars 885 forks source link

[Bug]: Compression policy job keeps on hanging using up lots of cpu #6223

Closed axpobastian closed 1 year ago

axpobastian commented 1 year ago

What type of bug is this?

Performance issue

What subsystems and features are affected?

Compression

What happened?

We have a compression_policy set on a hypertable. Currently, the job gets stuck (runs for hours if we don't cancel it) and consumes a significant amount of cpu. If we remove the compression_policy and readd it, it immediately gets stuck again.

TimescaleDB version affected

2.11.1

PostgreSQL version used

14.9

What operating system did you use?

not applicable

What installation method did you use?

Not applicable

What platform did you run on?

Managed Service for TimescaleDB (MST/Aiven)

Relevant log output and stack trace

No response

How can we reproduce the bug?

We do not know concrete steps to reproduce it with certainty.

Before the problem occurred we accidentally added a lot of old data (i.e. with timestamps way in the past) to our table, which resulted in a large number of new, uncompressed chunks (approx. 1 thousand). We already removed all those chunks manually, but with no improvement.

The accidental import of old data happened because we modified the code feeding the data to our database, so currently every now and then we insert data that belongs to chunks that are already compressed.

konskov commented 1 year ago

hello @axpobastian thank you for reaching out. Have you checked whether compression is waiting for any locks? What does a query like SELECT * FROM pg_locks l JOIN pg_stat_activity a ON a.pid = l.pid; return?

axpobastian commented 1 year ago

Thank you for getting back to us, I just ran the query you suggested and to me it seems it's not waiting for any locks. Attached is the output of the query:

output.csv

mkindahl commented 1 year ago

@axpobastian Thanks for providing the output above. It seems like the compression job is compressing a lot of data, which is why it takes such a long time. There are a few other things you can do to investigate if you can improve the performance of the job and pinpoint where the issue is.

You can get the uncompressed chunks using this query (here I am looking at the conditions table).

select format('%I.%I', chunk_schema, chunk_name) from timescaledb_information.chunks  where is_compressed = false and hypertable_name = 'conditions';

You can then try to compress a single one and see the performance of it:

SELECT compress_chunk('_timescaledb_internal._hyper_1_1_chunk');

Unfortunately, the compression policy job is single-threaded (we have an issue #3077 to improve this), which means that it has to compress each chunk in serial order.

Since you're compressing so much data, it looks like you've just done a backfill of data, or just turned on compression, which would explain the amount of data that the jobs need to compress.

axpobastian commented 1 year ago

Thank for looking into it.

We tried the query to see uncompressed chunks, and we didn't find any results other than the most recent chunks (we have daily chunks that normally get compressed every day). We didn't quite do a backfill, maybe I can give a quick summary of what we did to get here / what happened. I feel like our situation is complex and something is going very wrong.

In the beginning, we had some external systems constantly feeding data into our timescale db. We had a filter in the external system writing only data younger than 24 hours (since we use daily chunks and it used to be impossible to insert into compressed chunks). All of our daily chunks were compressed (as by the query above), except for the chunk for the current day. Then we removed the filter for data older than 24 hours, so if our source system sent data that was older than 24 hours, we also inserted it. Note that the amount of that old data we get from our source system is very small compared to the amount of recent data we write all the time.

After we removed the filter, the following happened:

After a day we noticed massive drops in performance, and the compression policy job being stuck for the first time. We looked for uncompressed chunks (with essentially the query above) and found a lot of uncompressed chunks (all old data that was uploaded due to some quirks of our source systems).
We also noticed a huge amount of errors on writing new data (DeadlockDetectedError, UniqueViolationError)

What we did then was: We removed all the uncompressed chunks using drop_chunks() since there were not relevant for us, and we put the filter in our source system back in so we do not write old data anymore. This helped with the errors on writing new data, however even though the output the above query now told us that we have no more uncompressed chunks, the compression policy job still kept getting stuck. At this point we removed the compression policy job and switched to compressing the new daily chunk manually.

A little later we also realized, that even without any compression policy job running our query performance for some queries deteriorated massively (mostly queries of the type select ts from measurements where signal_id = 8062 and ts >= '2023-10-20 21:07:05.881000 +00:00' order by ts asc limit 1 )

Sorry for the wall of text, I guess the tl;dr is that we started inserting a very small amount of data into already compressed chunks and everything started to fall apart.

edit: We also noticed that there is a function to recompress_chunks, is this related to our issue possibly?

mkindahl commented 1 year ago

@axpobastian Thanks for the thorough explanation, it give a much better insight into how you ended up in this situation.

Given the number of jobs running at the same time and the fact that you did not see any deadlocks, I wrote the suggestions because it looked like you just had a lot of data to compress and the jobs were not really stuck.

We also noticed a huge amount of errors on writing new data (DeadlockDetectedError, UniqueViolationError)

Even though this is for the situation preceding the current issue, it would be good to know these deadlock reports. They contain the lock as well as the OID of the relation and the query. Understanding these might help find out what went wrong here and maybe also offer insight into what is happening now.

A little later we also realized, that even without any compression policy job running our query performance for some queries deteriorated massively (mostly queries of the type select ts from measurements where signal_id = 8062 and ts >= '2023-10-20 21:07:05.881000 +00:00' order by ts asc limit 1 )

If you have the explain plan for one of those deteriorating queries, it is possible to see what the issue is.

edit: We also noticed that there is a function to recompress_chunks, is this related to our issue possibly?

When you have a mix of compressed and uncompressed data in a chunk, this function is used to decompress the compressed data, incorporate the new uncompressed data, and re-compress it. We had issues with it previously because it was executed inside a transaction and hence could grab locks in an order that would lead to a deadlock.

You can try to run it manually on the old chunks just to make sure that all data is compressed. See instructions under recompress_chunk. Although we haven't seen any issues with this, it is not entirely impossible that this could be part of your problem.

axpobastian commented 1 year ago

@mkindahl thank you so much for getting back to us, we think we might know how to tackle our problem.

We played around with the recompress_chunk procedure, we found the following:

on some chunks, it takes literal hours to complete (we observed 6 hours and 10 hours). Could this be the reason our compression_policy got stuck? I.e. it wasn't really stuck, it was just recompressing chunks which took a looong time.
after we recompressed some chunks, queries that only concern recompressed chunks behave normally again. I attached two plans, both for something like select ts from measurements where signal_id = 8062 and ts >= '2023-10-20 21:07:05.881000 +00:00' order by ts asc limit 1. As we can see, if it only touches recompressed_chunks, it correctly only looks at one chunk (where it finds the 1 data point) and skips all the others. If it also has non recompressed_chunks in the timespan, then it appears to me that it somehow goes through all the chunks, gets data, and sorts later. Of course that would be very slow.

Based on this, we are trying to recompress all our chunks. We are still unsure how to deal with the occasional chunk that takes hours to recompress, we would be very thankful for suggestions. Most chunks are done in minutes to seconds though. Also, we would appreciate a comment if our reasoning makes sense :)

If we are right on this, it would not be a timescale bug but something that the documentation maybe could point out as a major pitfall.

Result_1.csv Result_2.csv

mkindahl commented 1 year ago

on some chunks, it takes literal hours to complete (we observed 6 hours and 10 hours). Could this be the reason our compression_policy got stuck? I.e. it wasn't really stuck, it was just recompressing chunks which took a looong time.

This was my suspicion when you said you did not have any visible deadlocks, yes.

If it also has non recompressed_chunks in the timespan, then it appears to me that it somehow goes through all the chunks, gets data, and sorts later. Of course that would be very slow.

Yes, this is the problem. Since you have data that is both compressed and uncompressed, it decompresses all data (into memory), incorporates the uncompressed data, and then sorts it all, possibly spilling to disk, which will just make it worse. You can see it clearly in the second plan, while the first plan just visits the compressed chunks.

We are still unsure how to deal with the occasional chunk that takes hours to recompress, we would be very thankful for suggestions.

Not sure what is causing it to take hours to recompress. It sounds odd, but could have to do with other updates or inserts interfering with the process, or disk IO being throttled if it spills to disk. I suspect that you just have to let it finish, or run it when the machine is not busy.

mkindahl commented 1 year ago

Closing the issue since it seems you have a path to fixing this. If you still have issues, feel free to re-open.

zadeviggers commented 9 months ago

Hi @mkindahl,

I'm having a similar issue, except on my largest hypertable all the chunks seem to have rolled themselves into one huge chunk (~1.3tb) and uncompressed themselves (they were in several chunk and around ~600gb compressed the other day). I've been trying to figure out how to get the compression to run, or even tell if it's running. I changed next_start from -infinity to in 1 minute, and it soon changed back to -infinity, so I assume that means the task is running, but it could also just have failed and not cleaned up after itself.

Any guidance would be appreciated!

sensor_data=# SELECT show_chunks('sensor_data_4');
                show_chunks
-------------------------------------------
 _timescaledb_internal._hyper_29_338_chunk

sensor_data=# SELECT * FROM timescaledb_information.jobs
  WHERE proc_name='policy_compression' AND hypertable_name='sensor_data_4';
 job_id |     application_name      | schedule_interval | max_runtime | max_retries | retry_period |      proc_schema      |     proc_name      |  owner   | scheduled | fixed_schedule |                       config                        | next_start | initial_start | hypertable_schema | hypertable_name |     check_schema      |        check_name
--------+---------------------------+-------------------+-------------+-------------+--------------+-----------------------+--------------------+----------+-----------+----------------+-----------------------------------------------------+------------+---------------+-------------------+-----------------+-----------------------+--------------------------
   1006 | Compression Policy [1006] | 02:00:00          | 00:00:00    |          -1 | 01:00:00     | _timescaledb_internal | policy_compression | postgres | t         | f              | {"hypertable_id": 29, "compress_after": "02:00:00"} | -infinity  |               | public            | sensor_data_4   | _timescaledb_internal | policy_compression_check
(1 row)

mkindahl commented 9 months ago

Hi there @zadeviggers! Since this is quite a different problem compared to the original issue, could you please open a new issue with the information you have? It might be a good idea to look at _timescaledb_config.bgw_job as well since it contains the source information.

I would expect next start to be set to a later time after the job has executed: if that does not happen, something is wrong.

zadeviggers commented 9 months ago

I added another TB of disk space and it seems to have sorted itself out now. If this comes up again though, I'll open an issue.