stac-utils / pgstac

Schema, functions and a python library for storing and accessing STAC collections and items in PostgreSQL
MIT License
153 stars 39 forks source link

CQL2 filters resulting in full scans despite indexed properties #252

Closed drnextgis closed 7 months ago

drnextgis commented 8 months ago

Even when the property is indexed in certain filter requests, it is bypassed. Consider the following example:

postgis=> SELECT * from queryables WHERE name = 'platform';
 id |   name   | collection_ids |     definition     | property_path | property_wrapper | property_index_type 
----+----------+----------------+--------------------+---------------+------------------+---------------------
 12 | platform |                | {"type": "string"} |               | to_text          | BTREE
(1 row)

Queries like this one perform very quickly:

postgis=> EXPLAIN SELECT * FROM items WHERE to_text(content->'properties'->'platform') = 'p1';
                                                              QUERY PLAN                                                               
---------------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.28..160108.20 rows=162294 width=1059)
   ->  Index Scan using _items_399165_to_text_idx1 on _items_399165 items_1  (cost=0.28..2.49 rows=1 width=1759)
         Index Cond: (to_text(((content -> 'properties'::text) -> 'platform'::text)) = 'newsat26'::text)
         ...

However, using a CQL2 filter to search for more than a single value leads to a full scan:

postgis=> SELECT stac_search_to_where('{"filter-lang": "cql2-json", "filter": {"op": "a_overlaps","args": [{"property": "platform"}, ["p1", "p2"]]}}');
                     stac_search_to_where                      
---------------------------------------------------------------
 to_text_array(content->'properties'->'platform') && '{p1,p2}'
(1 row)

postgis=> EXPLAIN SELECT * FROM items WHERE to_text_array(content->'properties'->'platform') && '{p1,p2}';
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..307488051.71 rows=571130 width=1038)
   Workers Planned: 2
   ->  Parallel Append  (cost=0.00..307429938.71 rows=237960 width=1038)
         ->  Parallel Seq Scan on _items_1 items_128  (cost=0.00..60093820.23 rows=46785 width=971)
...

Doesn't it seem like the CQL2 filter generates a suboptimal WHERE clause?

drnextgis commented 8 months ago

It seems that a_overlaps should be used for properties containing lists of values, while IN is the appropriate operator for properties holding individual values.

postgis=> SELECT stac_search_to_where('{"filter-lang":"cql2-json","filter":{"op": "in", "args": [{"property": "platform"}, ["p1", "p2"]]}}');
                              stac_search_to_where                               
---------------------------------------------------------------------------------
 to_text(content->'properties'->'platform') IN (to_text('"p1"'),to_text('"p2"'))
(1 row)

postgis=> EXPLAIN SELECT * from items WHERE to_text(content->'properties'->'platform') IN (to_text('"p1"'),to_text('"p2"'));
                                                          QUERY PLAN                                                           
-------------------------------------------------------------------------------------------------------------------------------
 Append  (cost=0.28..530.61 rows=183 width=703)
   ->  Index Scan using _items_399165_to_text_idx1 on _items_399165 items_1  (cost=0.28..3.87 rows=1 width=1759)
         Index Cond: (to_text(((content -> 'properties'::text) -> 'platform'::text)) = ANY ('{p1,p2}'::text[]))
bitner commented 8 months ago

a_overlaps is for comparing an array to an array, so it must cast platform into a single element array. In this case property is not an array, so it must be cast to an array before comparing with the postgres && operator. There would need to be a GIN index along with the cast to_text_array in order to use an index for the a_overlaps operator. Right now, the way the queryables index tooling works, you can only have a single index type for a property, so you would need to chose whether you wanted to query that field always as an array using the array operators OR if you want to keep it as you have it now with a BTREE index casting that property just as a text value that you can use with = or IN operators.