Partition large tables - Githubissues

pnorman commented 5 years ago

From https://github.com/openstreetmap/openstreetmap-website/issues/375#issuecomment-441660239

We can't just say that we'll never change anything in our tables ever again! If this does involve large table changes, then we can work out what the zero-downtime migration path would be, and do that. We could practise multi-stage migrations on the smaller tables where there is less impact.

Whenever I describe our large API tables to a PostgreSQL, the number one comment is to partition them. This allows maintenance to be done without needing to sequentially scan billions of rows.

Are there tools for doing this easily in ruby? Should this be a site-specific thing to osm.org? Most users won't have multi-TB tables and it's possible to partition in a way that the rails port doesn't see it.

The benefits of partitioning are

maintenance is possible
indexes are smaller
cache locality is generally improved
table bloat is lessened
index bloat is lessened

tomhughes commented 5 years ago

Partition them on what?

I don't see how it helps with maintenance anyway - if a column needs to be rewritten then it needs to be rewritten and whether the table is in one chunk or a million chunks there is still the same amount of data to rewrite.

pnorman commented 5 years ago

Partition them on what?

ID for the OSM data tables. I'm not sure about GPX.

I don't see how it helps with maintenance anyway

It avoids doing it one big chunk that requires a long lock. Even for operations that don't require a lock, it's easier to deal with a reasonable number of shorter operations than one very long operation.

The bloat reasons are because most of the INSERT and UPDATE activity will be in one partition.

It speeds up pg_dump and pg_restore because these are bottlenecked on the largest table.

if a column needs to be rewritten then it needs to be rewritten and whether the table is in one chunk or a million chunks there is still the same amount of data to rewrite.

Schema changes requiring a rewrite aren't the primary reason everyone is suggesting this, but they will be faster because we can use the parallelism of the server, rather than be stuck single-threaded.

tomhughes commented 5 years ago

But you'd have to fragment them into a vast number of chunks to make them small enough that locks aren't an issue.

And are you really saying that rewriting a column doesn't need to lock the whole set of chunks anyway? Can it really cope with half the chunks being in the old schema and half in the new schema?

pnorman commented 5 years ago

And are you really saying that rewriting a column doesn't need to lock the whole set of chunks anyway? Can it really cope with half the chunks being in the old schema and half in the new schema?

Taking the nodes table with it's 8-12 billion rows as the worst, if we had 1 billion node partitions, we could do the column rewrite on all of them in parallel, and then rewrite the indexes on all of them at the same time.

tomhughes commented 5 years ago

I was thinking more like 10 to 100 thousand rows in a chunk if we want to make them small enough that a lock doesn't cause an issue!

Sure billion node partitions might reduce the time from a day to a few hours say, if there's enough I/O concurrency to get a more or less linear speedup, but it's still not something you'd want to do in production.

Also, if the indexes are separate doesn't that mean that all lookups become more expensive because (apart from any index that includes the id) you now have to check every partition for matches on every query...

pnorman commented 5 years ago

Sure billion node partitions might reduce the time from a day to a few hours say, if there's enough I/O concurrency to get a more or less linear speedup, but it's still not something you'd want to do in production.

Unless you're saying that the tables can never be changed, we have to be able to have locks and read-only periods for maintenance. This lessens them.

Also, if the indexes are separate doesn't that mean that all lookups become more expensive because (apart from any index that includes the id) you now have to check every partition for matches on every query...

Not really. My experience is that index size and performance remains about the same, unless there is a strong correlation between the partition criteria and the indexed value, in which case, it is faster. It's at most like a fraction of an extra node on the btree.

mmd-osm commented 5 years ago

One strategy to deal with data type changes is outlined in the following blog posts:

Instead of "Change the type of a column", " Add a new column, change the code to write to both columns, and backfill the new column", etc.

For sure that's more effort, it all depends a bit on how long read only or offline periods may take at most.

Regarding number of partitions for a table, the Postgresql docs suggest to keep their number in the 100's range at most. 1000+ is not at all recommended, as it severely impacts the optimizer performance.

gravitystorm commented 5 years ago

Instead of "Change the type of a column", " Add a new column, change the code to write to both columns, and backfill the new column", etc.

Yeah, that's what I was thinking of when I mentioned multi-stage migrations.

Of course, if there's some way of using partitioning to help, that would be good too. I'm not sure how well rails supports partitioning, and I don't know enough about partitioning yet to get a good feel for it. There are some gems available, but many of the gems mention pg10 and that's another problem that we would need to solve.

mmd-osm commented 5 years ago

The only thing I'm not sure about ~this approach~ the multi-stage approach is whether all relevant consumers (cgimap, osmosis, planetdump-ng, etc.), are happy with additional columns. This requires dedicated column names in SELECT statements, instead of some generic SELECT * FROM xyz and referencing results by column number.

tomhughes commented 5 years ago

Really? The partitions are visible to people querying the table?

mmd-osm commented 5 years ago

Sorry for mixing up topics, my comment was concerning the "multi-stage migration" approach. I fixed my post to make this point clear.

Partitions are invisible to people querying the table, otherwise you would have to change application code when adjusting partitioning criteria.

tomhughes commented 5 years ago

Having reviewed the documentation I see no reason at all why SELECT * on the parent table shouldn't work or the result wouldn't have columns named as expected.

I am much more concerned by the limitations listed at https://www.postgresql.org/docs/10/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE-LIMITATIONS which suggest that many of our constraints would have to be dropped.

pnorman commented 5 years ago

I am much more concerned by the limitations listed at https://www.postgresql.org/docs/10/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE-LIMITATIONS which suggest that many of our constraints would have to be dropped.

Do we have unique or exclusion constraints on the big tables aside from those which involve the partition key as the leading term? Looking at schema.sql I suspect the biggest table with a non-ID unique index is users, and that's not even at 10 million rows. All the primary keys have ID as the leading term, so if there are duplicates they would have to be in the same leaf.

tomhughes commented 5 years ago

Well I imagine they are mostly, if not all, on the id column but I don't see anything which allows that in that text?

pnorman commented 5 years ago

Well I imagine they are mostly, if not all, on the id column but I don't see anything which allows that in that text?

If you partition on id and have a primary key on id, any two duplicate ids will have to go to the same leaf, where they would violate the unique constraint on the leaf.

mmd-osm commented 5 years ago

I am much more concerned by the limitations listed at https://www.postgresql.org/docs/10/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE-LIMITATIONS which suggest that many of our constraints would have to be dropped.

I don't find those restrictions in the 9.5 documentation (https://www.postgresql.org/docs/9.5/ddl-partitioning.html). It seems Declarative partitioning was only introduced in Postgres 10.

pnorman commented 5 years ago

I don't find those restrictions in the 9.5 documentation (https://www.postgresql.org/docs/9.5/ddl-partitioning.html). It seems Declarative partitioning was only introduced in Postgres 10.

Yep, 10 added a 3rd way to do partitions. Perhaps it's worth waiting until we've moved off of 9.5.

mmd-osm commented 2 months ago

@pnorman : is this topic still relevant today? Should we move it to the operations repo instead of keeping it here?

pnorman commented 2 months ago

I don't see this as an ops issue as it is website specific and will impact anyone with billions of nodes or tags.

The uniform recommendation I've gotten from every PostgreSQL person who I've discussed our table size with is to partition them.

openstreetmap / openstreetmap-website

Partition large tables #2076