Support for adding columns to continuous views

derekjn commented 7 years ago

This has come up quite a few times, and it's probably time to add built-in support for it. It doesn't make sense to support this at the syntax level (i.e. ALTER CONTINUOUS VIEW), just as this can't really be done with regular VIEWs via ALTER VIEW.

The right approach probably involves just exposing functionality that allows users to achieve the desired semantics in a simple, reliable way. One way to do this would be to expose a function for combining two continuous views:

SELECT combine_continuous_views('result_cv', 'temp_cv');

This would combine the result_cv and temp_cv CVs into result_cv, and we'd probably want to drop temp_cv after a successful combine. Obviously this requires that both CVs:

Must have the same grouping
Must SELECT from the same stream/relation
Don't have any column name conflicts
Anything else?

The reason I think that this approach would work well is because it's conducive to backfilling the new aggregates/columns separately in their own CV, until it's time to combine them, and it's easily transactional as well.

/cc @usmanm @schapirama

schapirama commented 7 years ago

Thanks @derekjn : The combine function sounds like a good solution. However, I am concerned about one of the requirements --about grouping.

Let's say we have a CV with ts_minute, dimension1, count(*) GROUP BY ts_minute, dimension1

Now we want to add dimension2 to it --so we'd have ts_minute, dimension1, dimension2, count(*) GROUP BY ts_minute, dimension1, dimension2

Would this break the requirement? This is precisely the kind of migration we need to support.

derekjn commented 7 years ago

@schapirama if you're fine with a default value being used for the new group column for all pre-existing data then it should work. Is that what you were thinking?

schapirama commented 7 years ago

@derekjn NULL makes sense.... we just didn't have that information, it's "our fault". That's undestandable/logical.

derekjn commented 7 years ago

Gotcha, we'll think through this for a bit and adapt the original approach to support this. Adding a new grouping column is going to make this slower because it's going to require rewriting the entire mrel.

schapirama commented 7 years ago

@derekjn : Another question --we might also need to change the STREAM from which CV consumes.... would that be OK (via ALTER STREAM or a DROP without CASCADE)?

schapirama commented 7 years ago

@derekjn : Slower in terms of development ? Or in terms of execution time? The latter doesn't matter so much, at least in our use cases.

derekjn commented 7 years ago

Streams can already be altered, although only columns can be added:

ALTER STREAM s ADD column integer;

derekjn commented 7 years ago

@derekjn : Slower in terms of development ? Or in terms of execution time? The latter doesn't matter so much, at least in our use cases.

Both :) but I was mainly referring to execution time.

usmanm commented 7 years ago

I think the word combine might not be appropriate here since we use combine to denote combining two transition states. Maybe merge instead?

The hard requirements I think are:

Have the same FROM clause
Have the same WHERE clause
Both should have DISTINCT or both shouldn't
No column name conflicts

@schapirama, I'm not sure if adding a new grouping is going to be too useful when all the existing rows just get NULL for the new dimension. In that case, all new incoming data will never touch the old rows, unless the value of dimension2 was legitimately NULL. Or is the use of this only to have both old and new version of the query in the same view? In the sense that v1 is WHERE dimension2 IS NULL and v2 is WHERE dimension2 IS NOT NULL?

schapirama commented 7 years ago

Good point, @usmanm . It seems that I misunderstood @derekjn 's proposal.

Here's what I want. Let's say I have the following CV:

# CREATE CONTINUOUS VIEW myview WITH (ttl = '90 minutes', ttl_column = 'ts') AS SELECT date_round(event_timestamp, '1 minute') AS ts, deviceType, COUNT(*) AS event_count FROM mystream GROUP BY ts, deviceType;

# select * from myview;

ts	deviceType	count(*)
2017-01-01 00:00	Roku	100
2017-01-01 00:00	xBox	300
2017-01-01 00:00	iPhone	500

Now I want to do something like

# ALTER CONTINUOUS VIEW ADD COLUMN appVersion GROUP BY ts, deviceType, appVersion;

(I undestand that I am mixing things here --the new column AND using it in the GROUP BY-- ... I just want to describe our need)

The field appVersion must indeed be present in the stream (or we would add it before running this command). The FROM doesn't change, the WHERE doesn't change (although that would be a nice thing to be able to do at some point ;-), and there are no column name conflicts.

After running this, we'd like to see

# select * from myview;

ts	deviceType	appVersion	count(*)
2017-01-01 00:00	Roku	NULL	100
2017-01-01 00:00	xBox	NULL	300
2017-01-01 00:00	iPhone	NULL	500
2017-01-01 00:01	Roku	0.2	50
2017-01-01 00:01	Roku	0.3	40
2017-01-01 00:01	Roku	0.6	15
2017-01-01 00:01	xBox	1.0	90
2017-01-01 00:01	xBox	1.4	100
2017-01-01 00:01	iPhone	6	300
2017-01-01 00:01	iPhone	7	140

What I was planning to do was to

Change the stream (and drop+recreate the underlying TRANSFORM) to include the new field (which will be ignored by the existing CVIEWs)
Create a new cview myview_v2 with the new field
After 90 minutes, once I have all the data in the new view, I'd drop myview and start querying myview_v2

The biggest problem with this is that PipelineDB would be doing double the work for those 90 minutes (and in fact we're talking about longer TTLs and several CVIEWs at once). The other problem is that we won't have the new field available for querying until the 90 minutes have elapsed.

If I now understand correctly, @derekjn 's proposal would help us to step #3 ... which is nice in terms of letting us keep the old CVIEW's name ... but it would not address the two issues above.

My preference instead (but I don't know if this is feasible or even reasonable) would be to stop the world, alter the underlying mrels and setting the new columns to NULL, alter the definition of the view, and restart the world.

Thanks ;-)

esatterwhite commented 5 years ago

Adding a column without losing existing data would be great. That is one of my biggest concerns with pipielinedb currently. Even if there wasn't any back filling and just started processing new values.

pipelinedb / pipelinedb

Support for adding columns to continuous views #1736