History Retrieval Schema

ingalls commented 4 years ago

Context

We currently support full feature history retrieval that is based on the stored FeatureCollection within a given delta.

Although this allows us to return a full geometry for any given feature in time, it is very slow due to having to iterate through all the features in a given delta to find the requested feature.

This PR scopes breaking out the features in a given delta to be contained in their own table.

The new feature history query would then be able to simply select the feature history out of a single table with no additional parsing

Feature History

SELECT
  *
FROM
  geo_history
WHERE
  id = 123
ORDER BY
  VERSION DESC

Delta Retrieval

SELECT
  deltas.id,
  deltas.uid,
  users.username,
  geo_history.id,
  deltas.affected,
  deltas.props,
  deltas.created,
  deltas.props
 FROM
  deltas
  INNER JOIN users
    ON deltas.uid = users.id,
  geo_history
WHERE
  deltas.id = geo_history.delta

cc/ @ingalls

lizziegooding commented 4 years ago

Posting the query plans/execution time for the current feature history query vs the new proposed feature history query:

Current schema

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
-------------------------
Aggregate  (cost=3942.07..3942.08 rows=1 width=32) (actual time=99.199..99.199 rows=1 loops=1)
->  Subquery Scan on t  (cost=2608.76..3940.75 rows=264 width=49) (actual time=25.665..99.173 rows=3 loops=1)
Filter: (((t.feat ->> 'id'::text))::bigint = 123)
Rows Removed by Filter: 11756
->  ProjectSet  (cost=2608.76..2886.75 rows=52700 width=57) (actual time=24.044..68.504 rows=11759 loops=1)
->  Result  (cost=2608.76..2615.35 rows=527 width=45) (actual time=0.462..0.467 rows=3 loops=1)
->  Sort  (cost=2608.76..2610.08 rows=527 width=45) (actual time=0.461..0.463 rows=3 loops=1)
Sort Key: deltas.id DESC
Sort Method: quicksort  Memory: 25kB
->  Hash Join  (cost=1310.40..2584.93 rows=527 width=45) (actual time=0.445..0.452 rows=3 loops=1)
Hash Cond: (deltas.uid = users.id)
->  Bitmap Heap Scan on deltas  (cost=1308.09..2575.37 rows=527 width=36) (actual time=0.413..0.416 rows=3
loops=1)
Recheck Cond: (affected @> '{123}'::bigint[])
Heap Blocks: exact=3
->  Bitmap Index Scan on deltas_affected_idx  (cost=0.00..1307.95 rows=527 width=0) (actual time=0.40
8..0.408 rows=3 loops=1)
Index Cond: (affected @> '{123}'::bigint[])
->  Hash  (cost=2.14..2.14 rows=14 width=17) (actual time=0.024..0.024 rows=14 loops=1)
Buckets: 1024  Batches: 1  Memory Usage: 9kB
->  Seq Scan on users  (cost=0.00..2.14 rows=14 width=17) (actual time=0.008..0.015 rows=14 loops=1)
Planning Time: 0.218 ms
Execution Time: 99.391 ms

New schema

The new feature history query would then be able to simply select the feature history out of a single table with no additional parsing

This is not exactly true-- it still needs to pull data from the deltas and users table and convert everything to JSON. Here's the query i'm using now to replicate our existing functionality:

SELECT json_agg (
    JSON_Build_Object(
        'id', deltas.id,
        'uid', deltas.uid,
        'feat', JSON_Build_Object(
            'id', geo_history.id,
            'action', geo_history.action,
            'key', geo_history.key,
            'type', 'Feature',
            'version', geo_history.version,
            'geometry', ST_AsGeoJSON(geom)::JSON,
            'properties', geo_history.props
        ),
        'username', users.username
    )
)
FROM
    geo_history,
    deltas,
    users
WHERE 
    geo_history.id = 366672 AND
    geo_history.delta = deltas.id AND
    deltas.uid = users.id

I'd welcome suggestions on how to make this more performant.

                                                                 QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=188344.67..188344.68 rows=1 width=32) (actual time=3.987..3.987 rows=1 loops=1)
   ->  Hash Join  (cost=1474.94..113572.67 rows=29760 width=192) (actual time=3.918..3.925 rows=2 loops=1)
         Hash Cond: (geo_history.delta = deltas.id)
         ->  Bitmap Heap Scan on geo_history  (cost=1411.48..112342.15 rows=46312 width=152) (actual time=0.023..0.026 rows=2 loops=1)
               Recheck Cond: (id = 366672)
               Heap Blocks: exact=2
               ->  Bitmap Index Scan on geo_history_pkey  (cost=0.00..1399.90 rows=46312 width=0) (actual time=0.018..0.018 rows=2 loops=1)
                     Index Cond: (id = 366672)
         ->  Hash  (cost=61.85..61.85 rows=129 width=48) (actual time=3.887..3.887 rows=1500 loops=1)
               Buckets: 2048 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 99kB
               ->  Hash Join  (cost=1.32..61.85 rows=129 width=48) (actual time=0.042..2.890 rows=1500 loops=1)
                     Hash Cond: (deltas.uid = users.id)
                     ->  Seq Scan on deltas  (cost=0.00..52.36 rows=1836 width=16) (actual time=0.009..0.955 rows=1500 loops=1)
                     ->  Hash  (cost=1.14..1.14 rows=14 width=40) (actual time=0.026..0.026 rows=14 loops=1)
                           Buckets: 1024  Batches: 1  Memory Usage: 9kB
                           ->  Seq Scan on users  (cost=0.00..1.14 rows=14 width=40) (actual time=0.005..0.014 rows=14 loops=1)
 Planning time: 0.180 ms
 Execution time: 4.046 ms

It does still improve query time by a factor of 25, 99.391 ms vs 4.046 ms

ingalls commented 4 years ago

:taco:

mapbox / Hecate