Open martint opened 5 days ago
cc @nineinchnick , @raunaqmorarka
I think this is more about the implied limit, than predicates. There's a default limit used as a safeguard. Does it also violate the sql semantics?
The fact that the table always produces 1000 rows regardless of what predicate you use is, from SQL semantics perspective, incorrect. Conceptually, the table contains a certain number of rows, already predetermined. Any query over that table should behave as if the table were being fully materialized and then the query executed over those vaalues.
The second and third query should produce the same results. There expression rand() >= 0
is always true, so it should not affect the results.
In #24098, I'm proposing to automatically create a view with additional predicates when using CTAS
. If we'd always created this extra view, we can add a default LIMIT
there, and make it required when selecting from the regular table. WDYT?
In #24098, I'm proposing to automatically create a view with additional predicates when using
CTAS
. If we'd always created this extra view, we can add a defaultLIMIT
there, and make it required when selecting from the regular table. WDYT?
I'm not convinced about the view approach yet. You can make a separate PR for storing row count from CTAS into table property. For CREATE TABLE, I recommend we either drop the default limit or we make the relevant table property compulsory for CREATE. I also think it would be clearer to rename "default-limit" to "row-count".
I'll remove the default-limit
property and will make the LIMIT
clause mandatory. There's no other way of applying a default limit than with a view.
I'll remove the
default-limit
property and will make theLIMIT
clause mandatory. There's no other way of applying a default limit than with a view.
Making LIMIT mandatory for all SELECTs on the connector is also awkward, e.g. I might have a query which I know should filter out everything in the table, but i'll be forced to put a LIMIT clause on it for no reason.
To me having a required row-count
property on the table seems reasonable.
I agree, but I don't see another way of avoiding abusing the SQL semantics.
I agree, but I don't see another way of avoiding abusing the SQL semantics.
Maybe I'm missing something but having a required row count property on the table doesn't seem an abuse of SQL semantics to me. It seems to align with
Conceptually, the table contains a certain number of rows, already predetermined.
Any query over that table should behave as if the table were being fully materialized and then the query executed over those vaalues.
It won't work with applying ranges with predicates. We'd have to remove predicate pushdown entirely, and add column properties for specifying ranges, but that requires implementing parsing of expressions or literal values. I don't know how to do this without copying a lot of code.
We need to make a decision here how the connector should limit data ranges and row count, we can only pick one:
I don't know how to implement 2. If going with 1 would make the UX significantly worse, I'm ok with removing the connector entirely.
@losipiuk helped me understand how we should not use predicate pushdown to influence the generated data. I'll remove any predicate pushdown from this connector and reimplement that with column properties.
The connector seems to be using the WHERE clause to make decisions on which values to produce. This is an abuse of SQL semantics, and it leads to inconsistent and nonsensical results.
For example, given:
The following query produces 1000 values of random data, as expected:
However,
on the same table, produces 1000 rows with a value of
1
. By SQL semantics, that should not be the case unless the table contains 1000 rows withx = 1
, which contradicts the first query.Additionally,
produces the same data as if the first query. By SQL semantics,
x = 1
is equivalent tox =1 OR rand() >= 0
, so both queries should produce the same results.See related discussion in https://github.com/trinodb/trino/pull/24078