Use space dimension as a retention setting.

PIdaho commented 6 years ago

I am evaluating TimescaleDB and am interested in using the space dimension as a retention policy field.

CREATE TABLE sample_data (
  ts TIMESTAMPTS PRIMARY KEY,
  ret_policy SMALLINT NOT NULL,
  device SMALLINT NOT NULL,
  sample DOUBLE NOT NULL);

/*
Where ret_policy would be:
0 = 1 month
1 = 3 months
2 = 6 months
3 = 1 year
4 = 3 years
5 = 5 years
*/

SELECT create_hypertable('sample_data', 'ts', 'ret_policy', 6, interval '7 days');

The idea being we could use a function similar to drop_chunks() to drop chunks out of dimensions based on their retention policies. Something like this.

SELECT drop_chunks(interval '1 month', 'sample_data', dimension := 0);
SELECT drop_chunks(interval '3 month', 'sample_data', dimension := 1);
...

I know from reading that you hash the partition column into the different chunks. If I read correctly you turn that hash into an integer and then divide that integer by the number of partitions for assignment.

Is there a safe way to implement this scheme today? Maybe using a partitioning function?
Is it safe to make assumptions on how you assign partitions today? Assuming that you won't change it in the future.
Any suggestions how to implement the drop_chucks() functionality with today's implementation? Can I drop chunks and clean up TimescaleDB metadata safely or is some metadata cached?

I assume if I have dropped chunks and a new value comes in from the out of range past that a new chunk would be created regardless of my scheme described above.

mfreed commented 6 years ago

Hi @PIdaho would you mind sharing a bit more high-level about the desired functionality that you want, or use case you have?

The above feels like it's trying to shoehorn some existing mechanism into a use case, and I'd want to make sure there isn't a more elegant or first-class way of supporting your desired functionality.

PIdaho commented 6 years ago

We collect time-series data from many devices. These data samples are basically the same but some are at higher resolution and some are at lower resolution. We have requirements for different retention policies based on certain devices. Currently we manage these retention policies by putting data for the same policy in the same partitions. We can then drop partitions for quick deletes. It is basically one table with all of the related time-series data where each data has its own retention policy.

It just seems like adding a partition column for the retention policy would make this work as long as I know how partition columns are assigned to chunks and I had a safe way of dropping chunks by retention policy.

TimescaleDB has some really nice features. We like being able to use the COPY command on the parent table instead of having to figure out which partition to COPY into. We also like the dynamic partition creation. These would greatly simplify our partition management.

mfreed commented 6 years ago

It seems like the approach you might prefer is better support for non-hash partitioning, so that you can specify a distinct "space" partitioning by setting some column value, and then extending drop_chunks to allow one to subsequent these partitioning. Got it.

Will discuss with team how that fits into roadmap. Thanks!

andrew-blake commented 6 years ago

Mike, non-hash partitioning would also help with multi-tenanted systems where you frequently need different retention policies. We currently have to allow the system to retain the data for the customer who pays for the highest retention period, which isn't ideal.

PIdaho commented 6 years ago

@mfreed, thanks for your consideration on this issue. But what can I do today?

I am confident that I can manipulate the partition column and a partitioning function to get data into the correct partitions. The question real comes down to can I do the deletes with TS current implementation? I don't think so with drop_chunks(). I could write my own version of drop_chunks() in plpgsql but I need to know if it is safe to manipulate the chunk metadata or if that could cause some problems with TS. If you don't consider that safe I could always submit a C version to the TS project, but that would take a little longer to get spun up.

michaelmckay83 commented 6 years ago

I am also going to be needing a similar functionality. One possible solution that I have thought about doing is having a timestamp on when things should be deleted instead, that way you are back to a single column for doing partitioning. Then when you drop chunks you can drop on anything that you want to delete now() or sooner. Can anyone see any issues with this different approach?

evan-burke commented 6 years ago

+1 to this request.

andriyfomenko commented 5 years ago

we happen to have exactly the same use-case as @andrew-blake described above: different tenants have different retention expectations, although the whole set of retention policies is quite small (under 20 entries), so dropping chunks on the combination of section/time, rather than time only would work very efficiently

working back to "set time when it has to expire" would work for this one particular use case, but would not help us much from the data read perspective, this is why we would use TimeScale DB in the first place, so this is not a solution really

ideally, "drop_chunks()" should allow for additional/optional arguments to accept the spatial component filter when applying the time-component

cocowalla commented 4 years ago

I'm also really interested in this. Consider a multi-tenant system, where each tenant has it's own retention policy (between 3 months and 10 years) - being able to instantly drop chunks belong to a single tenant would be incredibly efficient!

erikns commented 4 years ago

+1

sergyv commented 4 years ago

+1

mfreed commented 3 years ago

For some follow-up -- we currently recommend using separate hypertables if you want this functionality, e.g., being able to have different data retention policies. The secondary advantages of separate hypertables is that, especailly in multi-tenant scenarios, including:

You can have different chunk time intervals for tenants with different write rates
Easier (hyper)table-level access control policies
Different continuous aggregate or other functions for different tenants, etc.
Easy space accounting by tenant

This is actually the approach taking by Promscale, where different Prometheus metrics are stored in separate hypertables: https://github.com/timescale/promscale

cocowalla commented 3 years ago

I switched to the described solution after receiving the same via Slack (for the multi-tenant problem I described earlier in this thread).

This wouldn't feel strange for non-multi-tenant scenarios, but "table-per-tenant" does feel like an odd setup; still, it does work and does provide the states advantages.

cchengubnt commented 3 years ago

+1, any plan to this request? my multi-tenant scenarios, i need different retention policies for multi data security levels of different area, required by the law of different states. so the table designed will be like this:

create table area_A(id, data, time, level);
create table area_B(id, data, time, level);
SELECT create_hypertable('area_A', 'time', 'level', 4, partitioning_func => 'level_value_hash');

#retention policy
SELECT drop_chunks(interval '3 month', 'area_A', level := 1);
SELECT drop_chunks(interval '6 month', 'area_A', level := 4);

By the way, the table timescaledb_information.chunks only saved primary_dimension, i can't locate any secondary dimension column with partition range, and i can't drop chunks table manually.

kgonia commented 3 years ago

I have the same need but from different use case. I have data from different devices but some of them are less interesting for me than others. Still I don't wanna delete less interesting data. I would like to move less interesting devices to other postgres tablespace.

I already have data in table also program that ingesting data don't differentiate between devices. I can move some data to other hypertable and run some cron on daily basis but I'm looking for some more 'native' solution.

rotten commented 2 years ago

If we follow the current recommendations for a multi-tenant solution where we use different tables for different tenants who are expecting different retention policies, then we might have a "30 day" table, a "60 day" table, and a "90 day" table. This could work pretty well as long as we know the retention at insertion time (which might require an additional lookup).

However, when a tenant changes their mind and switches their contract from 30 days to 90 days... We'd have to copy all of their data out of the 30 day table and into the 90 day table instead of simply updating the retention policy for that tenant. This would probably leave a lot of holes in that old table. (Is that a bad thing?) It would also be tricky to do to keep the data consistent during the copy and cut over to the new table.

The alternative is that every tenant gets their own table, which could prove to be rather unwieldy to manage when you get into the thousands of clients.

Most likely how I would implement this is to retain the data for the full length of of time for everyone, but hide it from the customer in application logic. This doesn't save any storage, but is probably the least painful.

I think ideally you could set policies based on where conditions. Without any where condition it would be "where age of insert is > some interval". Otherwise it would be "where the age is > some interval AND the user_agent_string is null AND the customer is Big Corp".

muntdan commented 2 years ago

SELECT tableoid::regclass as chunk, datetime_col__cast_as_interval, dimension_location_column FROM <hypertable> GROUP BY tableoid, datetime_col__cast_as_interval,dimension_location_column;

DROP TABLE <chunk>;

Arcturuss commented 1 year ago

Any progress on this?

The workaround with "one hypertable per client" is really just a workaround, and won't work for every case.

yarkoyarok commented 6 months ago

While one hypertable per client for our team taste felt as overwhelmed solution we decided to use direct deletion from compressed chunks by segment_by.

This gist can be helpful to understand how to find compressed chunk names by date range:

https://gist.github.com/yarkoyarok/3277a27987415b40368b53d70f348add

Then you can found rows, which consists of columns with compressed data and raw uncompressed segment_by fields.

These rows, actually, are segments.

Deleting of them occurs without decompression and being proceeded very fast (<1 sec for segment dropping vs ~1:30 minutes for same amount of data being deleted from hypertable directly in our case).

So you can not only drop chunks, but if you work with compressed data you can drop more atomic parts - segments.

More information on this technique can be found also here #5802. I've proposed there to be used by query planner.

adriangb commented 5 months ago

+1 for this.

A good first step would be making it easy to find chunks by both primary (time) and secondary dimension so that we can at least manually implement retention policies.

SeaRoll commented 4 months ago

+1 on this. The workaround seems to be very prone on SQL injection if implemented badly on sql drivers

oliora commented 2 weeks ago

+1

timescale / timescaledb

Use space dimension as a retention setting. #563