Open derekjn opened 7 years ago
Thanks @derekjn : The combine function sounds like a good solution. However, I am concerned about one of the requirements --about grouping.
Let's say we have a CV with ts_minute, dimension1, count(*) GROUP BY ts_minute, dimension1
Now we want to add dimension2 to it --so we'd have ts_minute, dimension1, dimension2, count(*) GROUP BY ts_minute, dimension1, dimension2
Would this break the requirement? This is precisely the kind of migration we need to support.
@schapirama if you're fine with a default value being used for the new group column for all pre-existing data then it should work. Is that what you were thinking?
@derekjn NULL makes sense.... we just didn't have that information, it's "our fault". That's undestandable/logical.
Gotcha, we'll think through this for a bit and adapt the original approach to support this. Adding a new grouping column is going to make this slower because it's going to require rewriting the entire mrel.
@derekjn : Another question --we might also need to change the STREAM from which CV consumes.... would that be OK (via ALTER STREAM or a DROP without CASCADE)?
@derekjn : Slower in terms of development ? Or in terms of execution time? The latter doesn't matter so much, at least in our use cases.
Streams can already be altered, although only columns can be added:
ALTER STREAM s ADD column integer;
@derekjn : Slower in terms of development ? Or in terms of execution time? The latter doesn't matter so much, at least in our use cases.
Both :) but I was mainly referring to execution time.
I think the word combine
might not be appropriate here since we use combine
to denote combining two transition states. Maybe merge
instead?
The hard requirements I think are:
FROM
clauseWHERE
clauseDISTINCT
or both shouldn't@schapirama, I'm not sure if adding a new grouping is going to be too useful when all the existing rows just get NULL
for the new dimension. In that case, all new incoming data will never touch the old rows, unless the value of dimension2
was legitimately NULL
. Or is the use of this only to have both old and new version of the query in the same view? In the sense that v1 is WHERE dimension2 IS NULL
and v2 is WHERE dimension2 IS NOT NULL
?
Good point, @usmanm . It seems that I misunderstood @derekjn 's proposal.
Here's what I want. Let's say I have the following CV:
# CREATE CONTINUOUS VIEW myview WITH (ttl = '90 minutes', ttl_column = 'ts') AS SELECT date_round(event_timestamp, '1 minute') AS ts, deviceType, COUNT(*) AS event_count FROM mystream GROUP BY ts, deviceType;
# select * from myview;
ts | deviceType | count(*) |
---|---|---|
2017-01-01 00:00 | Roku | 100 |
2017-01-01 00:00 | xBox | 300 |
2017-01-01 00:00 | iPhone | 500 |
Now I want to do something like
# ALTER CONTINUOUS VIEW ADD COLUMN appVersion GROUP BY ts, deviceType, appVersion;
(I undestand that I am mixing things here --the new column AND using it in the GROUP BY-- ... I just want to describe our need)
The field appVersion must indeed be present in the stream (or we would add it before running this command). The FROM
doesn't change, the WHERE
doesn't change (although that would be a nice thing to be able to do at some point ;-), and there are no column name conflicts.
After running this, we'd like to see
# select * from myview;
ts | deviceType | appVersion | count(*) |
---|---|---|---|
2017-01-01 00:00 | Roku | NULL | 100 |
2017-01-01 00:00 | xBox | NULL | 300 |
2017-01-01 00:00 | iPhone | NULL | 500 |
2017-01-01 00:01 | Roku | 0.2 | 50 |
2017-01-01 00:01 | Roku | 0.3 | 40 |
2017-01-01 00:01 | Roku | 0.6 | 15 |
2017-01-01 00:01 | xBox | 1.0 | 90 |
2017-01-01 00:01 | xBox | 1.4 | 100 |
2017-01-01 00:01 | iPhone | 6 | 300 |
2017-01-01 00:01 | iPhone | 7 | 140 |
What I was planning to do was to
The biggest problem with this is that PipelineDB would be doing double the work for those 90 minutes (and in fact we're talking about longer TTLs and several CVIEWs at once). The other problem is that we won't have the new field available for querying until the 90 minutes have elapsed.
If I now understand correctly, @derekjn 's proposal would help us to step #3 ... which is nice in terms of letting us keep the old CVIEW's name ... but it would not address the two issues above.
My preference instead (but I don't know if this is feasible or even reasonable) would be to stop the world, alter the underlying mrels and setting the new columns to NULL, alter the definition of the view, and restart the world.
Thanks ;-)
Adding a column without losing existing data would be great. That is one of my biggest concerns with pipielinedb currently. Even if there wasn't any back filling and just started processing new values.
This has come up quite a few times, and it's probably time to add built-in support for it. It doesn't make sense to support this at the syntax level (i.e.
ALTER CONTINUOUS VIEW
), just as this can't really be done with regularVIEWs
viaALTER VIEW
.The right approach probably involves just exposing functionality that allows users to achieve the desired semantics in a simple, reliable way. One way to do this would be to expose a function for combining two continuous views:
This would combine the
result_cv
andtemp_cv
CVs intoresult_cv
, and we'd probably want to droptemp_cv
after a successful combine. Obviously this requires that both CVs:SELECT
from the same stream/relationThe reason I think that this approach would work well is because it's conducive to backfilling the new aggregates/columns separately in their own CV, until it's time to combine them, and it's easily transactional as well.
/cc @usmanm @schapirama