rustprooflabs / pgosm-flex

PgOSM Flex provides high quality OpenStreetMap datasets in PostGIS (Postgres) using the osm2pgsql Flex output.
MIT License
100 stars 20 forks source link

Enable selective indexing #344

Closed rustprooflabs closed 1 year ago

rustprooflabs commented 1 year ago

Initial PoC to enable configuring which indexes users want. RE #286.

This illustrates how indexes can be defined via an .ini file. To test, checkout this branch and docker build a local image with the changes. The test is in the place tables, there is a --layerset=place that exists now that will be deleted before merging.

docker exec -it     pgosm python3 docker/pgosm_flex.py     --ram=8     --region=north-america/us     --subregion=district-of-columbia --layerset=place 

Notes on current path

Currently putting a lot of logic in place.lua. Need to work on generalizing this into a function in the helpers module. Ultimately this should result in less Lua code per layer with indexes defined in an INI and parsed by a common helper.

spatial indexes

Always creates spatial index (gist or spgist). Currently the decision of gist/spgist is per layer, so setting for place sets for point/line/polygon. Not ideal, but a huge improvement over having that setting defined for all tables! As the logic becomes more reusable, setting gist/spgist per table should be possible. There's no reason why there should be a forced gist index other than that's the way it is right now.

primary keys

I don't think the primary keys can be created directly via osm2pgsql flex at this time. That means the PK index is still only controlled by post-processing SQL. Also, osm.place_polygon_nested is hard coded in post-processing SQL.

Next steps

Put logic to build the indexes table into a function. This should include support for the most commonly used columns throughout PgOSM Flex layers. This is a rough starting point of what might be good candidates. Suggestions welcome!

Many of the indexes indicated above will likely be partial indexes such as the name column index.

if index_name then
    indexes[next_index_id] = { column = 'name', method = 'btree', where = 'name IS NOT NULL ' }
    next_index_id = next_index_id + 1
end

Customization is expected to be enabled much like custom layersets. Map a volume to a path, and tell PgOSM Flex to look there.

rustprooflabs commented 1 year ago

To make this work for road without calling it a breaking change it needs to support per-table customization. The road_line table has an index on the major column. The road_point table does not have that column.

Per table customization (not per layer) is also needed for gist/spgist.

rustprooflabs commented 1 year ago

With the latest commit (https://github.com/rustprooflabs/pgosm-flex/commit/4b05a27407f156899a0c7ecada1a6b033abe3db2 not showing up in this PR yet?) this feature is essentially functional. The new docs page has the basic examples and details. A couple folders with custom index schemes are provided. The indexes/examples/noindexes/* example creates tables with only the primary keys. The lotsofindexes example indexes nearly every column.

Needs some cleanup, and removing of the original SP-GIST code. Plus improved documentation.

rustprooflabs commented 1 year ago

Rebased and force pushed after fixing #345.

rustprooflabs commented 1 year ago

This PR captures everything outlined by #286. The best explanation of the work here is outlined in the new custom indexes page in the docs. That link is the MD here in GH. It'll be live on https://pgosm-flex.com when merged into main.

This PR does not preserve the exact indexes created in 0.9.0 and prior. As I started looking at recreating exactly what was there, I realized many of the indexes don't make much sense as defaults. This PR standardizes and simplifies the default indexes.

With this in place, I want to to look at timing differences and database size differences between default indexes (geom and osm_type), examples/noindexes, and examples/lotsofindexes.

I'm not planning on writing a migration script for dropping / changing indexes. The indexes are only defined during initial table creation. Current users of --replication can evaluate their indexes and decide on a) leaving everything as is, b) manually drop unwanted indexes, or c) Start over with a fresh import. Now that it's possible (and hopefully easy!) to create custom indexes, it's a good time for users to evaluate their performance needs and potential indexes.

rustprooflabs commented 1 year ago

Rebased to main

rustprooflabs commented 1 year ago

D.C. indexes, 0.9.0. 41 seconds.

┌────────┬─────────────┬─────────────┬────────────┐
│ s_name │ table_count │ index_count │ index_size │
╞════════╪═════════════╪═════════════╪════════════╡
│ osm    │          42 │         151 │ 35 MB      │
└────────┴─────────────┴─────────────┴────────────┘

After, 40 seconds. 27 fewer indexes.

┌────────┬─────────────┬─────────────┬────────────┐
│ s_name │ table_count │ index_count │ index_size │
╞════════╪═════════════╪═════════════╪════════════╡
│ osm    │          42 │         124 │ 34 MB      │
└────────┴─────────────┴─────────────┴────────────┘
rustprooflabs commented 1 year ago

After convincing myself this wasn't a breaking change, it obviously is. This removes the --sp-gist option.