Closed axpobastian closed 1 year ago
hello @axpobastian thank you for reaching out. Have you checked whether compression is waiting for any locks?
What does a query like SELECT * FROM pg_locks l JOIN pg_stat_activity a ON a.pid = l.pid;
return?
Thank you for getting back to us, I just ran the query you suggested and to me it seems it's not waiting for any locks. Attached is the output of the query:
@axpobastian Thanks for providing the output above. It seems like the compression job is compressing a lot of data, which is why it takes such a long time. There are a few other things you can do to investigate if you can improve the performance of the job and pinpoint where the issue is.
You can get the uncompressed chunks using this query (here I am looking at the conditions
table).
select format('%I.%I', chunk_schema, chunk_name) from timescaledb_information.chunks where is_compressed = false and hypertable_name = 'conditions';
You can then try to compress a single one and see the performance of it:
SELECT compress_chunk('_timescaledb_internal._hyper_1_1_chunk');
Unfortunately, the compression policy job is single-threaded (we have an issue #3077 to improve this), which means that it has to compress each chunk in serial order.
Since you're compressing so much data, it looks like you've just done a backfill of data, or just turned on compression, which would explain the amount of data that the jobs need to compress.
Thank for looking into it.
We tried the query to see uncompressed chunks, and we didn't find any results other than the most recent chunks (we have daily chunks that normally get compressed every day). We didn't quite do a backfill, maybe I can give a quick summary of what we did to get here / what happened. I feel like our situation is complex and something is going very wrong.
In the beginning, we had some external systems constantly feeding data into our timescale db. We had a filter in the external system writing only data younger than 24 hours (since we use daily chunks and it used to be impossible to insert into compressed chunks). All of our daily chunks were compressed (as by the query above), except for the chunk for the current day. Then we removed the filter for data older than 24 hours, so if our source system sent data that was older than 24 hours, we also inserted it. Note that the amount of that old data we get from our source system is very small compared to the amount of recent data we write all the time.
After we removed the filter, the following happened:
What we did then was: We removed all the uncompressed chunks using drop_chunks() since there were not relevant for us, and we put the filter in our source system back in so we do not write old data anymore. This helped with the errors on writing new data, however even though the output the above query now told us that we have no more uncompressed chunks, the compression policy job still kept getting stuck. At this point we removed the compression policy job and switched to compressing the new daily chunk manually.
A little later we also realized, that even without any compression policy job running our query performance for some queries deteriorated massively (mostly queries of the type select ts from measurements where signal_id = 8062 and ts >= '2023-10-20 21:07:05.881000 +00:00' order by ts asc limit 1
)
Sorry for the wall of text, I guess the tl;dr is that we started inserting a very small amount of data into already compressed chunks and everything started to fall apart.
edit: We also noticed that there is a function to recompress_chunks, is this related to our issue possibly?
@axpobastian Thanks for the thorough explanation, it give a much better insight into how you ended up in this situation.
Given the number of jobs running at the same time and the fact that you did not see any deadlocks, I wrote the suggestions because it looked like you just had a lot of data to compress and the jobs were not really stuck.
We also noticed a huge amount of errors on writing new data (DeadlockDetectedError, UniqueViolationError)
Even though this is for the situation preceding the current issue, it would be good to know these deadlock reports. They contain the lock as well as the OID of the relation and the query. Understanding these might help find out what went wrong here and maybe also offer insight into what is happening now.
A little later we also realized, that even without any compression policy job running our query performance for some queries deteriorated massively (mostly queries of the type select ts from measurements where signal_id = 8062 and ts >= '2023-10-20 21:07:05.881000 +00:00' order by ts asc limit 1 )
If you have the explain plan for one of those deteriorating queries, it is possible to see what the issue is.
edit: We also noticed that there is a function to recompress_chunks, is this related to our issue possibly?
When you have a mix of compressed and uncompressed data in a chunk, this function is used to decompress the compressed data, incorporate the new uncompressed data, and re-compress it. We had issues with it previously because it was executed inside a transaction and hence could grab locks in an order that would lead to a deadlock.
You can try to run it manually on the old chunks just to make sure that all data is compressed. See instructions under recompress_chunk. Although we haven't seen any issues with this, it is not entirely impossible that this could be part of your problem.
@mkindahl thank you so much for getting back to us, we think we might know how to tackle our problem.
We played around with the recompress_chunk procedure, we found the following:
select ts from measurements where signal_id = 8062 and ts >= '2023-10-20 21:07:05.881000 +00:00' order by ts asc limit 1
. As we can see, if it only touches recompressed_chunks, it correctly only looks at one chunk (where it finds the 1 data point) and skips all the others. If it also has non recompressed_chunks in the timespan, then it appears to me that it somehow goes through all the chunks, gets data, and sorts later. Of course that would be very slow.Based on this, we are trying to recompress all our chunks. We are still unsure how to deal with the occasional chunk that takes hours to recompress, we would be very thankful for suggestions. Most chunks are done in minutes to seconds though. Also, we would appreciate a comment if our reasoning makes sense :)
If we are right on this, it would not be a timescale bug but something that the documentation maybe could point out as a major pitfall.
on some chunks, it takes literal hours to complete (we observed 6 hours and 10 hours). Could this be the reason our compression_policy got stuck? I.e. it wasn't really stuck, it was just recompressing chunks which took a looong time.
This was my suspicion when you said you did not have any visible deadlocks, yes.
If it also has non recompressed_chunks in the timespan, then it appears to me that it somehow goes through all the chunks, gets data, and sorts later. Of course that would be very slow.
Yes, this is the problem. Since you have data that is both compressed and uncompressed, it decompresses all data (into memory), incorporates the uncompressed data, and then sorts it all, possibly spilling to disk, which will just make it worse. You can see it clearly in the second plan, while the first plan just visits the compressed chunks.
We are still unsure how to deal with the occasional chunk that takes hours to recompress, we would be very thankful for suggestions.
Not sure what is causing it to take hours to recompress. It sounds odd, but could have to do with other updates or inserts interfering with the process, or disk IO being throttled if it spills to disk. I suspect that you just have to let it finish, or run it when the machine is not busy.
Closing the issue since it seems you have a path to fixing this. If you still have issues, feel free to re-open.
Hi @mkindahl,
I'm having a similar issue, except on my largest hypertable all the chunks seem to have rolled themselves into one huge chunk (~1.3tb) and uncompressed themselves (they were in several chunk and around ~600gb compressed the other day). I've been trying to figure out how to get the compression to run, or even tell if it's running. I changed next_start
from -infinity
to in 1 minute, and it soon changed back to -infinity
, so I assume that means the task is running, but it could also just have failed and not cleaned up after itself.
Any guidance would be appreciated!
sensor_data=# SELECT show_chunks('sensor_data_4');
show_chunks
-------------------------------------------
_timescaledb_internal._hyper_29_338_chunk
sensor_data=# SELECT * FROM timescaledb_information.jobs
WHERE proc_name='policy_compression' AND hypertable_name='sensor_data_4';
job_id | application_name | schedule_interval | max_runtime | max_retries | retry_period | proc_schema | proc_name | owner | scheduled | fixed_schedule | config | next_start | initial_start | hypertable_schema | hypertable_name | check_schema | check_name
--------+---------------------------+-------------------+-------------+-------------+--------------+-----------------------+--------------------+----------+-----------+----------------+-----------------------------------------------------+------------+---------------+-------------------+-----------------+-----------------------+--------------------------
1006 | Compression Policy [1006] | 02:00:00 | 00:00:00 | -1 | 01:00:00 | _timescaledb_internal | policy_compression | postgres | t | f | {"hypertable_id": 29, "compress_after": "02:00:00"} | -infinity | | public | sensor_data_4 | _timescaledb_internal | policy_compression_check
(1 row)
Hi there @zadeviggers! Since this is quite a different problem compared to the original issue, could you please open a new issue with the information you have? It might be a good idea to look at _timescaledb_config.bgw_job
as well since it contains the source information.
I would expect next start to be set to a later time after the job has executed: if that does not happen, something is wrong.
I added another TB of disk space and it seems to have sorted itself out now. If this comes up again though, I'll open an issue.
What type of bug is this?
Performance issue
What subsystems and features are affected?
Compression
What happened?
We have a compression_policy set on a hypertable. Currently, the job gets stuck (runs for hours if we don't cancel it) and consumes a significant amount of cpu. If we remove the compression_policy and readd it, it immediately gets stuck again.
TimescaleDB version affected
2.11.1
PostgreSQL version used
14.9
What operating system did you use?
not applicable
What installation method did you use?
Not applicable
What platform did you run on?
Managed Service for TimescaleDB (MST/Aiven)
Relevant log output and stack trace
No response
How can we reproduce the bug?
We do not know concrete steps to reproduce it with certainty.
Before the problem occurred we accidentally added a lot of old data (i.e. with timestamps way in the past) to our table, which resulted in a large number of new, uncompressed chunks (approx. 1 thousand). We already removed all those chunks manually, but with no improvement.
The accidental import of old data happened because we modified the code feeding the data to our database, so currently every now and then we insert data that belongs to chunks that are already compressed.