Continuous joining and filtering

WalkerWalker commented 1 year ago

What problem does the new feature solve?

I have a time series table of sensor data and I have an assignment table defining at what time the device is worn by which person. Now I am interested in the data of one or many people.

What does the feature do?

say sensor_data table has columns (time, device_id, temperature, location_lang, location_lat) and assignments table has columns (assignment_id, person_id, device_id, time_range)

I want a materialized view of the following query.

SELECT 
    sensor_data.time, 
    assignments.person_id, 
    assignments.device_id, 
    sensor_data.temperature, 
    sensor_data.location_lang, 
    sensor_data.location_lat 
FROM sensor_data
LEFT JOIN assignments
   ON sensor_data.device_id = assignments.device_id
WHERE sensor_data.time in assignments.timerange

Apparently the query result can be as big as the senor_data table, making it impractical to run once the senor_data table is too big (over 1TB before decompression). But the query result should be very stable. Essentially as the new records coming into the sensor_data table, the materialized view table should also have new records, according to the assignement table. But if history never changes and assignments are never modified, the old data stays the same.

Maybe the result can be materialized day by day (chunk by chunk) and having the lastest day result queried live? Should I call this continuous joining and filtering?

Implementation challenges

No response

WalkerWalker commented 1 year ago

Hope to get some discussions

fabriziomello commented 1 year ago

Hey @WalkerWalker, thanks for reaching out.

There are an ongoing PR already: https://github.com/timescale/timescaledb/pull/4874

WalkerWalker commented 1 year ago

There are an ongoing PR already: https://github.com/timescale/timescaledb/pull/4874

Thanks for the reply. After reading this PR, my understanding is that this is allowing continuous aggregation on a join result. If my understanding is correct, then it is not what my feature request is about. This feature request is to compute the join, not to compute the continuous aggregation on top of join.

Here is a simple reason why computing this join is hard. The join result can be as big as the sensor data and therefore it is impractical to compute i in one go if sensor data is too large (for example over 1TB before compression). Hence I hope the join result can be continuously, or incrementally, computed.

WalkerWalker commented 1 year ago

I did a bit of research and notice that this might be what I want. https://github.com/sraoss/pg_ivm

WalkerWalker commented 1 year ago

would be nice to get some feedback, is this feature requested understood? meaningful? implementable?

WalkerWalker commented 1 year ago

There are an ongoing PR already: #4874

Hello @fabriziomello Now that this PR is merged, could you please shine a light on this issue? Do you think this new feature of 2.10.0 can solve my problem? Again, I only want to compute the join, but not aggregation on top of the join. Do you think I can use a super small bucket bucket, for example 10ms so that no down-sampling is actually performed ?

fabriziomello commented 1 year ago

There are an ongoing PR already: #4874

Hello @fabriziomello Now that this PR is merged, could you please shine a light on this issue? Do you think this new feature of 2.10.0 can solve my problem? Again, I only want to compute the join, but not aggregation on top of the join. Do you think I can use a super small bucket bucket, for example 10ms so that no down-sampling is actually performed ?

Currently we don't support Continuous Aggregate without time dimension and aggregation. So there's no option other than add a small bucket. Also the current JOIN support is restricted to INNER JOIN with equality operator. Have a look at the documentation.

timescale / timescaledb