Open zpuller opened 9 months ago
cc: @alexjo2144 @findinpath
can this be anyhow related to https://github.com/trinodb/trino/pull/19479?
can this be anyhow related to #19479?
One note is that we originally encountered this in v414, which I believe predates that PR
@zpuller I've tried the following scenario:
testing/bin/ptl env up --environment singlenode-spark-iceberg
In spark-sql
Create the table:
CREATE TABLE spark_catalog.default.t1 (
id INT,
name STRING,
age INT,
address STRUCT<street: STRING NOT NULL, address_info: STRUCT<city: STRING, county: STRING, state: STRING>>)
USING iceberg;
Write data into the table:
INSERT INTO t1
SELECT
*
FROM VALUES
(1, 'John Doe', 35, CAST(NULL AS STRUCT<street: STRING, address_info: STRUCT<city: STRING, county: STRING, state: STRING>>)),
(2, 'Jane Doe', 27, named_struct(
'street',
CAST('456 Lane' AS STRING),
'address_info',
CAST(struct('San Francisco', 'San Francisco', 'CA') AS STRUCT<city: STRING, county: STRING, state: STRING>)
)),
(3, 'Mary Johnson', 30, named_struct(
'street',
CAST('789 Boulevard' AS STRING),
'address_info',
CAST(struct('Portland', 'Multnomah', 'OR') AS STRUCT<city: STRING, county: STRING, state: STRING>)
)) AS t(id, name, age, address);
The statement fails with the following exception message:
Cannot write incompatible data to table 'spark_catalog.default.t1':
- Cannot write nullable values to non-null field: 'address.street'.
Can you pls sketch in a more detailed manner how to reproduce the issue? You can use Scala if you are more at ease with it.
The SELECT statement returns the expected result on master and 437.
spark-sql> CREATE TABLE t USING iceberg AS SELECT
*
FROM VALUES
(1, 'John Doe', 35, CAST(NULL AS STRUCT<street: STRING, address_info: STRUCT<city: STRING, county: STRING, state: STRING>>)),
(2, 'Jane Doe', 27, named_struct(
'street',
CAST('456 Lane' AS STRING),
'address_info',
CAST(struct('San Francisco', 'San Francisco', 'CA') AS STRUCT<city: STRING, county: STRING, state: STRING>)
)),
(3, 'Mary Johnson', 30, named_struct(
'street',
CAST('789 Boulevard' AS STRING),
'address_info',
CAST(struct('Portland', 'Multnomah', 'OR') AS STRUCT<city: STRING, county: STRING, state: STRING>)
)) AS t(id, name, age, address);
trino:default> SET SESSION iceberg.projection_pushdown_enabled=true;
trino:default> SELECT id FROM t WHERE address.street is null;
id
----
1
trino:default> SET SESSION iceberg.projection_pushdown_enabled=false;
trino:default> SELECT id FROM t WHERE address.street is null;
id
----
1
I think I will need to write some custom scala code as mentioned above to provide a better consistent repro of the issue. Let me take some time to figure that out and I will post back here.
Ideally, it should be written by SQL. We don't use Scala code in our tests.
@zpuller Gentle reminder.
Sorry for the late response. In response to what @findinpath tried, it might just not like the syntax of your cast in the insert statement. Maybe try:
INSERT INTO t1
SELECT
*
FROM VALUES
(1, 'John Doe', 35, CAST(NULL AS STRUCT<street: STRING NOT NULL, address_info: STRUCT<city: STRING, county: STRING, state: STRING>>)),
(2, 'Jane Doe', 27, named_struct(
'street',
CAST('456 Lane' AS STRING NOT NULL),
'address_info',
CAST(struct('San Francisco', 'San Francisco', 'CA') AS STRUCT<city: STRING, county: STRING, state: STRING>)
)),
(3, 'Mary Johnson', 30, named_struct(
'street',
CAST('789 Boulevard' AS STRING NOT NULL),
'address_info',
CAST(struct('Portland', 'Multnomah', 'OR') AS STRUCT<city: STRING, county: STRING, state: STRING>)
)) AS t(id, name, age, address);
@ebyhr , your create table statement is omitting the NON NULL
schema declarations, that's why it doesn't yield the inconsistent results.
could also try:
INSERT INTO t1
SELECT
id,
name,
age,
CASE
WHEN id = 1 THEN NULL
ELSE address
END
FROM (
SELECT
*
FROM VALUES
(1, 'Jane Doe', 27, named_struct(
'street',
CAST('456 Lane' AS STRING NOT NULL),
'address_info',
CAST(struct('San Francisco', 'San Francisco', 'CA') AS STRUCT<city: STRING, county: STRING, state: STRING>)
)),
(2, 'Jane Doe', 27, named_struct(
'street',
CAST('456 Lane' AS STRING NOT NULL),
'address_info',
CAST(struct('San Francisco', 'San Francisco', 'CA') AS STRUCT<city: STRING, county: STRING, state: STRING>)
)),
(3, 'Mary Johnson', 30, named_struct(
'street',
CAST('789 Boulevard' AS STRING NOT NULL),
'address_info',
CAST(struct('Portland', 'Multnomah', 'OR') AS STRUCT<city: STRING, county: STRING, state: STRING>)
)) AS t(id, name, age, address)
);
Iceberg connector gives different results for nested null checks with projection pushdown enabled vs disabled, specifically in the case of a schema with an optional struct containing a required inner field, eg.
To reproduce, it is required to use something like Spark to write the table. We observed this with parquet format specifically. It cannot be created with Trino (to my knowledge) because SQL syntax does not permit specifying NOT NULL for nested types.
I created an iceberg table using the following spark sql query:
(I also had to manually tweak the schema to get the right set of optional and required fields)
then queried from Trino as follows:
This returns 1 row when pushdown is disabled, and 0 rows with pushdown enabled.
I verified this behavior on v437.