timeplus-io / proton

A streaming SQL engine, a fast and lightweight alternative to ksqlDB and Apache Flink, 🚀 powered by ClickHouse.
https://timeplus.com
Apache License 2.0
1.56k stars 68 forks source link

Do not allow select directly from a MV that uses an external stream as target stream #593

Open zliang-min opened 8 months ago

zliang-min commented 8 months ago

Describe what enhancement you'd like to have

Gvien an kafka external stream called ex_stream. Let's say a user creates two MVs using that external stream as target stream CREATE MATERIALIZED VIEW mv_one INTO ex_stream AS ... and CREATE MATERIALIZED VIEW mv_two INTO ex_stream AS .... Then, the SQL SELECT * FROM table(mv_one) does not return the results as the user expects. The user probably thinks that SELECT will only returns data generated by mv_one, but this is not the case, it will return all the data in ex_stream, no matter who sent the data to the underlying kafka topic ( there could even be other clients running somewhere sending data to that topic too ). So, running SELECT from the MV directly could be confusing, we should ban it.

zliang-min commented 8 months ago

In fact, this issue also applies to MVs which use a target stream, no matter if that target stream is an external stream, a Proton stream, or an external table. We probably should not allow select from a MV directly as long as it uses a custom target stream.

chenziliang commented 8 months ago

proxy to the pointed target stream is consistent in all cases and makes more sense vs throw exception for different cases to me. it is like a pointer, if other updated the things pointed, accessing the pointer gets the updates

chenziliang commented 8 months ago

same for ingesting to a stream, there are multiple producers, when select, everyone gots all data other inserted.

chenziliang commented 8 months ago

Documentation wise, easier : selecting a mv always proxying to the underlying stream it points to in all cases

zliang-min commented 8 months ago

proxy to the pointed target stream is consistent in all cases and makes more sense vs throw exception for different cases to me. it is like a pointer, if other updated the things pointed, accessing the pointer gets the updates

I think it depends on what "consistent" means here. IMO, what you described here is from the technical implementation point of view, but this may not be the "consistency" that users expect. For example, given a MV that does not use a custom target:

CREATE MATERIALIZED VIEW hot_weather AS
  SELECT * FROM weather_history WHERE temperature > 30

When the user runs SELECT * FROM hot_weather, what does they expect? They would expect that the returned results all have temperature > 30. And now imagine, the MV uses a target stream:

CREATE MATERIALIZED VIEW hot_weather INTO ex_stream AS
  SELECT * FROM weather_history WHERE temperature > 30

Now, SELECT * FROM hot_weather would return everything. Of course, we can explain why it behaves in this way, but I think this likely surprises users at first ( even we have it explained in docs, but who would read all the docs before using proton? ). From this point of view, one can argue that it's not consistent.

chenziliang commented 8 months ago

Then why user like to send all hot things to one target stream ?

zliang-min commented 8 months ago

Then why user like to send all hot things to one target stream ?

I am not sure if I understand your question. It's just an example, users can send anything they want to a target stream. I don't see why it's not a valid use case.