CAGG refresh policy stuck after deleting data from the main table.

BilalS10 commented 3 years ago

I have the latest tsdb (2.1) and pgsql (13.2) running on debian 10 with 16 vcpus and 64 gb ram and 1tb ssd on a gcp VM. I have 1 main hyper table that stores the raw data and 2 CAGG views, 1st to store hourly aggregated data and 2nd to store daily aggregated data. I have also configured refresh policies for both that have the following config : start data : null, end date : 1day (for daily) and 1 hour (for hourly) duration: every minute ( for hourly) and every 2 mins (for daily)..

I am facing the issue when I am deleting specific data ranging from 2 years back till today.. after running the delete query, I observe the refresh jobs (SELECT * FROM timescaledb_information.job_stats) and the jobs keeps running forever for both the views.. Screenshot from 2021-03-22 20-39-08

After around half an hour, the job for daily view, successfully finished but I could still see the agg data in the daily view. The hourly job didnt finish only and after more than 2 hours I killed the job as it seemed stuck. Next I ran a manual refresh for hourly view but even that query got stuck and I had to kill that process as well after an hour.. This happened with me twice. Kindly help me out with this.

I am attaching the screenshot of the job stats query and the invalidation logs table herewith for reference..

Thank you.

erimatnor commented 3 years ago

@BilalS10 thanks for reporting this issue. One thing to note is that it looks like your continuous aggregate policies are configured to refresh more-or-less everything in the continuous aggregate. This can be quite expensive in case you do large changes to the underlying data, like deleting lots of data.

It would be useful to know how you deleted the data in the hypertable. Was it, e.g., via DELETE or drop_chunks? Can you also provide some information on how much data was dropped in terms of storage size?

BilalS10 commented 3 years ago

Hi @erimatnor ,

yeah..we have a requirement where historical data can be added as well as deleted. hence the start offset has been kept as null. Though, I do understand, that we can consider manual refresh for data that is affecting such a huge range but before considering that, we wanted to see how is the performance of tsdb refresh policy for such a range.

As for the info that you have asked, it was a programmatic delete query and the no of points that were deleted were 963. But the range was from 13th july 2019 to feb 2021..

Also, i tried the manual refresh for hourly aggregate again, and this time it did finish successfully but took 1 hour 45 mins.. Another weird thing to note here as I mentioned, is that after the refresh was completed, I could still see the first date data in both the aggregated views..i.e : the data for 13th july is still present.. ideally it should delete all the points and include the first as well as last data point..

I am attaching the results of all the 3 queries herewith.. (sensor_data is the main hypertable and the other 2 are the CAGG) Screenshot from 2021-03-22 23-13-08

timescale / timescaledb

CAGG refresh policy stuck after deleting data from the main table. #3047