trinodb / trino

Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
https://trino.io
Apache License 2.0
10.32k stars 2.97k forks source link

Incorrect query behavior when using Parquet float statistics #21754

Open mojodna opened 5 months ago

mojodna commented 5 months ago

The Overture Maps Foundation published a beta release recently as a set of Parquet files hosted on S3 (s3://overturemaps-us-west-2/release/2024-04-16-beta.0; 322 objects, 315GB, sorted roughly by geometry to improve data locality). As part of this release, elements of the bbox struct were changed from doubles to floats because the precision was unnecessary (the switch saved ~50GB across the whole dataset). Care was taken to floor / ceil geometry bounding boxes in order to account for reduced precision.

We received a bug report indicating that this change had unanticipated consequences when queried by Trino and Amazon Athena: where a query against the previous release returned 26,023 results, the same query against the beta release (with very similar data, especially in the target area) returns 2.

SELECT id,
       ST_GeomFromBinary(geometry) AS geometry
FROM
    overture.release.v2024_04_16_beta_0
WHERE theme='places'
    AND type='place'
    AND bbox.xmin > -122.44
        AND bbox.xmax < -122.25
        AND bbox.ymin > 47.56
        AND bbox.ymax < 47.71

This query is included in the documentation (along with instructions for registering Overture tables): https://docs.overturemaps.org/getting-data/cloud-services/

Setting 'parquet.ignore.statistics' = 'true' on the table causes the original query to return the correct number of rows (at the expense of a lot more data scanned: 540.50 MB vs. 8.21 MB (against the alpha release, since that works)).

We were able to reproduce this in Amazon EMR running Trino 435. https://github.com/trinodb/trino/pull/20981 appears related, but querying against a system with this patch resulted in the same behavior.

Additional context on why GeoParquet intends to make this change (prototyped by Overture to get large-scale feedback): https://github.com/opengeospatial/geoparquet/discussions/188#discussioncomment-7569062

mojodna commented 5 months ago

Key finding: Trino appears to be skipping row group 7 within s3://overturemaps-us-west-2/release/2024-04-16-beta.0/theme=places/type=place/part-00001-29b4e626-9eee-4912-9d91-89d3cef80839-c000.zstd.parquet.

My initial hypothesis was that this is related to Trino's use of Parquet statistics for row group filtering. If I modify the query so that it's unable to benefit from the statistics (08f28d554665a66b038f21323384850e is the ID of a place present in that bounding box but is not one of the 2 results returned), it returns the correct number of rows:

SELECT
  id,
  ST_GeomFromBinary(geometry) AS geometry
FROM overture.release.v2024_04_16_beta_0
WHERE theme='places'
  AND type='place'
  AND 
  (
    (
      bbox.xmin > -122.44
      AND bbox.xmax < -122.25
      AND bbox.ymin > 47.56
      AND bbox.ymax < 47.71
    )
    OR id='08f28d554665a66b038f21323384850e'
  )

All of the potential results that should be returned are contained in s3://overturemaps-us-west-2/release/2024-04-16-beta.0/theme=places/type=place/part-00001-29b4e626-9eee-4912-9d91-89d3cef80839-c000.zstd.parquet (it's 960MB)

I used DuckDB to determine this:

INSTALL spatial;
LOAD spatial;
INSTALL https;
LOAD httpfs;
SET s3_region='us-west-2';
.maxwidth 10000

SELECT
  filename,
  file_row_number,
  id,
  ST_GeomFromWkb(geometry) as geometry
FROM read_parquet('s3://overturemaps-us-west-2/release/2024-04-16-beta.0/theme=places/type=place/*', filename=true, hive_partitioning=1, file_row_number=true)
WHERE id IN ('08f28d541dd6e63003a77ace5be7e109', '08f28d5419ce94c203a7d77ee8ada7d2', '08f28d554665a66b038f21323384850e');
┌─────────────────┬───────────────────────────────────────────────────────────────────┬──────────────────────────────────┬─────────────────────────────────┐
│ file_row_number │                             filename                              │                id                │            geometry             │
│      int64      │                              varchar                              │             varchar              │            geometry             │
├─────────────────┼───────────────────────────────────────────────────────────────────┼──────────────────────────────────┼─────────────────────────────────┤
│         1019387 │ part-00001-29b4e626-9eee-4912-9d91-89d3cef80839-c000.zstd.parquet │ 08f28d554665a66b038f21323384850e │ POINT (-122.409775 47.5674384)  │
│         1096710 │ part-00001-29b4e626-9eee-4912-9d91-89d3cef80839-c000.zstd.parquet │ 08f28d541dd6e63003a77ace5be7e109 │ POINT (-122.2519391 47.6845154) │
│         1096726 │ part-00001-29b4e626-9eee-4912-9d91-89d3cef80839-c000.zstd.parquet │ 08f28d5419ce94c203a7d77ee8ada7d2 │ POINT (-122.2516924 47.7094373) │
└─────────────────┴───────────────────────────────────────────────────────────────────┴──────────────────────────────────┴─────────────────────────────────┘

There are 77 row groups in this file; 08f28d554665a66b038f21323384850e (which was not included in the results, but should have been) is in row group 7 (containing rows starting at 959130); the 2 returned are in row group 8 (rows starting at 1095090). (I used rowIndexOffset from parquet footer part-00001-29b4e626-9eee-4912-9d91-89d3cef80839-c000.zstd.parquet (available from parquet-cli in Homebrew) to determine this.)

Trino appears to be skipping row group 7. I further confirmed this with a pair of DuckDB queries:

select count(*) from (SELECT file_row_number, file_row_number < 1095090 rg_7, file_row_number >= 1095090 rg_8,
    filename,
    id,
    ST_GeomFromWkb(geometry) as geometry
  FROM read_parquet('part-00001-29b4e626-9eee-4912-9d91-89d3cef80839-c000.zstd.parquet', filename=true, hive_partitioning=1, file_row_number=true)
  WHERE bbox.xmin > -122.44 AND bbox.xmax < -122.25 AND bbox.ymin > 47.56 AND bbox.ymax < 47.71) as _ where rg_7;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│        25897 │
└──────────────┘
select count(*) from (SELECT file_row_number, file_row_number < 1095090 rg_7, file_row_number >= 1095090 rg_8,
    filename,
    id,
    ST_GeomFromWkb(geometry) as geometry
  FROM read_parquet('part-00001-29b4e626-9eee-4912-9d91-89d3cef80839-c000.zstd.parquet', filename=true, hive_partitioning=1, file_row_number=true)
  WHERE bbox.xmin > -122.44 AND bbox.xmax < -122.25 AND bbox.ymin > 47.56 AND bbox.ymax < 47.71) as _ where rg_8;

┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│            2 │
└──────────────┘

The relevant statistics for row group 7 are (using parquet meta):

Row group 7:  count: 135960  93.62 B records  start: 90886904  total(compressed): 12.139 MB total(uncompressed):32.996 MB
--------------------------------------------------------------------------------
                                                     type      encodings count     avg size   nulls   min / max
bbox.xmin                                            FLOAT     Z   _     135960    2.27 B     0       "-123.75" / "-120.938225"
bbox.xmax                                            FLOAT     Z   _     135960    2.26 B     0       "-123.75" / "-120.93821"
bbox.ymin                                            FLOAT     Z   _     135960    2.42 B     0       "45.0" / "47.812466"
bbox.ymax                                            FLOAT     Z   _     135960    2.42 B     0       "45.0" / "47.812473"

And row group 8:

Row group 8:  count: 137440  94.13 B records  start: 103615740  total(compressed): 12.338 MB total(uncompressed):32.681 MB
--------------------------------------------------------------------------------
                                                     type      encodings count     avg size   nulls   min / max
bbox.xmin                                            FLOAT     Z   _     137440    2.42 B     0       "-123.75" / "-118.12669"
bbox.xmax                                            FLOAT     Z   _     137440    2.42 B     0       "-123.75" / "-118.12668"
bbox.ymin                                            FLOAT     Z   _     137440    2.48 B     0       "45.000034" / "49.218746"
bbox.ymax                                            FLOAT     Z   _     137440    2.48 B     0       "45.00004" / "49.218754"
mojodna commented 5 months ago

The DDL to register the table (w/o 'parquet.ignore.statistics' = 'true') looks like this:

CREATE EXTERNAL TABLE `v2024_04_16_beta_0` (
  `id` string,
  `geometry` binary,
  `bbox` struct<xmin:float,xmax:float,ymin:float,ymax:float>,
  `admin_level` int,
  `is_maritime` boolean,
  `geopol_display` string,
  `version` int,
  `update_time` string,
  `sources` array<struct<property:string,dataset:string,record_id:string,confidence:double>>,
  `subtype` string,
  `locality_type` string,
  `wikidata` string,
  `context_id` string,
  `population` int,
  `iso_country_code_alpha_2` string,
  `iso_sub_country_code` string,
  `default_language` string,
  `driving_side` string,
  `names` struct<primary:string,common:map<string,string>,rules:array<struct<variant:string,language:string,value:string,at:array<double>,side:string>>>,
  `locality_id` string,
  `class` string,
  `elevation` int,
  `source_tags` map<string,string>,
  `surface` string,
  `is_salt` boolean,
  `is_intermittent` boolean,
  `level` int,
  `has_parts` boolean,
  `height` double,
  `num_floors` int,
  `facade_color` string,
  `facade_material` string,
  `roof_material` string,
  `roof_shape` string,
  `roof_direction` double,
  `roof_orientation` string,
  `roof_color` string,
  `eave_height` double,
  `min_height` double,
  `building_id` string,
  `categories` struct<main:string,alternate:array<string>>,
  `confidence` double,
  `websites` array<string>,
  `socials` array<string>,
  `emails` array<string>,
  `phones` array<string>,
  `brand` struct<wikidata:string,names:struct<primary:string,common:map<string,string>,rules:array<struct<variant:string,language:string,value:string,at:array<double>,side:string>>>>,
  `addresses` array<struct<freeform:string,locality:string,postcode:string,region:string,country:string>>,
  `connector_ids` array<string>,
  `road` string)
PARTITIONED BY (
  `theme` string,
  `type` string)
STORED AS PARQUET
LOCATION
  's3://overturemaps-us-west-2/release/2024-04-16-beta.0'
findepi commented 5 months ago

are you able to reproduce the problem with a single Parquet file?

mojodna commented 5 months ago

Yes.

I copied s3://overturemaps-us-west-2/release/2024-04-16-beta.0/theme=places/type=place/part-00001-29b4e626-9eee-4912-9d91-89d3cef80839-c000.zstd.parquet and created a new table definition that points to that path (only containing that file):

CREATE EXTERNAL TABLE `v2024_04_16_beta_0_subset` (
  `id` string,
  `geometry` binary,
  `bbox` struct<xmin:float,xmax:float,ymin:float,ymax:float>,
  `admin_level` int,
  `is_maritime` boolean,
  `geopol_display` string,
  `version` int,
  `update_time` string,
  `sources` array<struct<property:string,dataset:string,record_id:string,confidence:double>>,
  `subtype` string,
  `locality_type` string,
  `wikidata` string,
  `context_id` string,
  `population` int,
  `iso_country_code_alpha_2` string,
  `iso_sub_country_code` string,
  `default_language` string,
  `driving_side` string,
  `names` struct<primary:string,common:map<string,string>,rules:array<struct<variant:string,language:string,value:string,at:array<double>,side:string>>>,
  `locality_id` string,
  `class` string,
  `elevation` int,
  `source_tags` map<string,string>,
  `surface` string,
  `is_salt` boolean,
  `is_intermittent` boolean,
  `level` int,
  `has_parts` boolean,
  `height` double,
  `num_floors` int,
  `facade_color` string,
  `facade_material` string,
  `roof_material` string,
  `roof_shape` string,
  `roof_direction` double,
  `roof_orientation` string,
  `roof_color` string,
  `eave_height` double,
  `min_height` double,
  `building_id` string,
  `categories` struct<main:string,alternate:array<string>>,
  `confidence` double,
  `websites` array<string>,
  `socials` array<string>,
  `emails` array<string>,
  `phones` array<string>,
  `brand` struct<wikidata:string,names:struct<primary:string,common:map<string,string>,rules:array<struct<variant:string,language:string,value:string,at:array<double>,side:string>>>>,
  `addresses` array<struct<freeform:string,locality:string,postcode:string,region:string,country:string>>,
  `connector_ids` array<string>,
  `road` string)
STORED AS PARQUET
LOCATION
  's3://<bucket>/trino-21754'

The equivalent SQL query against that table again only returned 2 rows:

SELECT id,
       ST_GeomFromBinary(geometry) AS geometry
FROM
    v2024_04_16_beta_0_subset
WHERE bbox.xmin > -122.44
        AND bbox.xmax < -122.25
        AND bbox.ymin > 47.56
        AND bbox.ymax < 47.71
raunaqmorarka commented 4 months ago

I downloaded the above file locally using aws s3 cp --region us-west-2 --no-sign-request s3://overturemaps-us-west-2/release/2024-04-16-beta.0/theme=places/type=place/part-00001-29b4e626-9eee-4912-9d91-89d3cef80839-c000.zstd.parquet . and created a table against it. I couldn't reproduce the problem with latest master, for me the output was the same 25899 rows with or without parquet statistics enabled.

SELECT count(*)
            -> FROM
            ->     v2024_04_16_beta_0
            -> WHERE bbox.xmin > -122.44
            ->         AND bbox.xmax < -122.25
            ->         AND bbox.ymin > 47.56
            ->         AND bbox.ymax < 47.71;
 _col0
-------
 25899
(1 row)