Faker connector produces incorrect results

trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)

https://trino.io

Apache License 2.0

10.49k stars 3.02k forks source link

Faker connector produces incorrect results #24147

Open martint opened 5 days ago

martint commented 5 days ago

The connector seems to be using the WHERE clause to make decisions on which values to produce. This is an abuse of SQL semantics, and it leads to inconsistent and nonsensical results.

For example, given:

CREATE TABLE t (x bigint)

The following query produces 1000 values of random data, as expected:

SELECT * FROM t

However,

SELECT * FROM t WHERE x = 1

on the same table, produces 1000 rows with a value of 1. By SQL semantics, that should not be the case unless the table contains 1000 rows with x = 1, which contradicts the first query.

Additionally,

SELECT * FROM t WHERE x = 1 OR rand() >= 0

produces the same data as if the first query. By SQL semantics, x = 1 is equivalent to x =1 OR rand() >= 0, so both queries should produce the same results.

See related discussion in https://github.com/trinodb/trino/pull/24078

martint commented 5 days ago

cc @nineinchnick , @raunaqmorarka

nineinchnick commented 5 days ago

I think this is more about the implied limit, than predicates. There's a default limit used as a safeguard. Does it also violate the sql semantics?

martint commented 5 days ago

The fact that the table always produces 1000 rows regardless of what predicate you use is, from SQL semantics perspective, incorrect. Conceptually, the table contains a certain number of rows, already predetermined. Any query over that table should behave as if the table were being fully materialized and then the query executed over those vaalues.
The second and third query should produce the same results. There expression rand() >= 0 is always true, so it should not affect the results.

nineinchnick commented 4 days ago

In #24098, I'm proposing to automatically create a view with additional predicates when using CTAS. If we'd always created this extra view, we can add a default LIMIT there, and make it required when selecting from the regular table. WDYT?

raunaqmorarka commented 3 days ago

In #24098, I'm proposing to automatically create a view with additional predicates when using CTAS. If we'd always created this extra view, we can add a default LIMIT there, and make it required when selecting from the regular table. WDYT?

I'm not convinced about the view approach yet. You can make a separate PR for storing row count from CTAS into table property. For CREATE TABLE, I recommend we either drop the default limit or we make the relevant table property compulsory for CREATE. I also think it would be clearer to rename "default-limit" to "row-count".

nineinchnick commented 3 days ago

I'll remove the default-limit property and will make the LIMIT clause mandatory. There's no other way of applying a default limit than with a view.

raunaqmorarka commented 3 days ago

I'll remove the default-limit property and will make the LIMIT clause mandatory. There's no other way of applying a default limit than with a view.

Making LIMIT mandatory for all SELECTs on the connector is also awkward, e.g. I might have a query which I know should filter out everything in the table, but i'll be forced to put a LIMIT clause on it for no reason. To me having a required row-count property on the table seems reasonable.

nineinchnick commented 3 days ago

I agree, but I don't see another way of avoiding abusing the SQL semantics.

raunaqmorarka commented 3 days ago

I agree, but I don't see another way of avoiding abusing the SQL semantics.

Maybe I'm missing something but having a required row count property on the table doesn't seem an abuse of SQL semantics to me. It seems to align with

Conceptually, the table contains a certain number of rows, already predetermined. 
Any query over that table should behave as if the table were being fully materialized and then the query executed over those vaalues.

nineinchnick commented 3 days ago

It won't work with applying ranges with predicates. We'd have to remove predicate pushdown entirely, and add column properties for specifying ranges, but that requires implementing parsing of expressions or literal values. I don't know how to do this without copying a lot of code.

We need to make a decision here how the connector should limit data ranges and row count, we can only pick one:

Using the WHERE and LIMIT clauses.
Using column properties.

I don't know how to implement 2. If going with 1 would make the UX significantly worse, I'm ok with removing the connector entirely.

nineinchnick commented 2 days ago

@losipiuk helped me understand how we should not use predicate pushdown to influence the generated data. I'll remove any predicate pushdown from this connector and reimplement that with column properties.