hapostgres / pg_auto_failover

Postgres extension and service for automated failover and high-availability
Other
1.09k stars 114 forks source link

Add support for tablespaces #853

Closed rheaton closed 2 years ago

rheaton commented 2 years ago

This allows up to 16 tablespace-mappings to be defined when creating nodes.

Fixes #844

Notes:

I'll be around a few days next week and this week to make updates from feedback/changes. 🙂

DimCitus commented 2 years ago

Thanks a lot for working on this @rheaton ! So my first question is: what happens if non-default (extra) table spaces have been created on an instance and the new option --tablespace-mappings is not used at all? My current (albeit quick) reading of your patch makes me think we have the same problem as before in that case, is that true?

Reading through pg_basebackup source code and then streaming replication protocol BASE_BACKUP docs I'm left wondering if we can make it work our way.

The better option that I can think of right now consists of using the --format=tar option to pg_basebackup so that we fetch the files from upstream as tar archives in our replication.backup_directory location. That would make sense actually, and also help understand our two phases approach. We would get a list of table spaces that way (one tar archive file per table space), and we can also inspect the base.tar file for the tablespace_map file and have our default mapping from there.

Then about providing a table space mapping. Do we actually want to do that? I mean, I like the idea that every node in the HA system should be the same, right?

In passing, another discussion internally led us to rethink the way we are handling the base backup transition in pg_autoctl. At the moment I made the choice to first download the new PGDATA copy, and then when it is known that we have a new copy, we delete the old PGDATA and replace it. That's because of my years as a DBA: I want to be able to do forensics on the old PGDATA copy for as long as possible. The drawback is that you now need double the disk space. The idea is to make this behaviour a configuration option, the default being remove PGDATA first then fetch a new copy, and a tuning would allow for delete-as-late-as-possible.

rheaton commented 2 years ago

I like keeping the directory structure the same; the idea with this patch is that the tablespace-mappings parameter is required if you are using tablespaces, so we can make sure the directories are cleared. I believe if the user configures the tablespace-mappings to link to the same thing (e.g. originaldir=originaldir) we would have the desired behavior. Which makes me think we should just require the user to give us tablespaces=/mnt/superfastdisk;/mnt/superhugedisk and then we can make sure those locations exist and are empty.

I worry about the tar solution for the same reason as the double-the-disk-space issue you mentioned (and one was filed today https://github.com/citusdata/pg_auto_failover/issues/857).

I think users who have tablespaces can give us the list and we can make this work pretty simply. We are going to have to figure out a way to have separate containers with matching sets of mount dirs to automate these tests in a more realistic scenario.

What do you think about requiring the user to specify their tablespaces like so: tablespaces=/mnt/superfastdisk;/mnt/superhugedisk and trying to figure out a way to test it? I'm willing to take this on, but it might not be completed before the Christmas-time holiday.

DimCitus commented 2 years ago

My first reaction is still the same, we should find a way that pg_auto_failover just work without user intervention, even when they're using non-default table spaces. Maybe what we could do as a first step is implement _delete PGDATA first, then pgbasebackup in place so that it works for everyone. And then we can add the option of keeping PGDATA around for as long as possible, which is very costly in terms of disk space, and in that case we're good with using pg_basebackup --format=tar I suppose. After all this would be for paranoid users who know they have the necessary disk space or accept the failures when they don't.

rheaton commented 2 years ago

I think that we'd have the same issue for tablespaces, as there isn't currently a reasonable way to get the tablespace locations prior to backing up. We won't be able to delete those locations like we do for PGDATA.

I agree that we should do the PGDATA deletion, in addition to something for tablespaces. It would be nice if postgres allowed a replication user to get tablespace information prior to a pgbasebackup. :-/

DimCitus commented 2 years ago

I think that we'd have the same issue for tablespaces, as there isn't currently a reasonable way to get the tablespace locations prior to backing up. We won't be able to delete those locations like we do for PGDATA.

There must be a way to read the list of table spaces from the local PGDATA. A quick test locally seems to show that we can read the local pg_tblspc directory. It contains a list of symlinks and that's where we need to remove stuff. If a new table space has been created on the new primary, it's not a problem, because if we don't have it listed in the local PGDATA that we want to destroy then we know we won't have a conflict at pg_basebackup time.

$ ls -l monitor/pg_tblspc/
total 0
lrwx------  1 dim  staff  8 Dec 15 21:01 17165 -> /tmp/foo

$ psql -X -p 5500 -d postgres -c "select oid, spcname, pg_catalog.pg_tablespace_location(oid) AS location from pg_tablespace"
  oid  |  spcname   | location
-------+------------+----------
  1663 | pg_default |
  1664 | pg_global  |
 17165 | foo        | /tmp/foo
(3 rows)

What do you think?

rheaton commented 2 years ago

Oh good idea! I didn't think about just looking in the existing directory. Obviously there can be cases where this won't work, but I think pg_basebackup will just fail out with a decent error message.

Where should we go next for this? Would you like me to remove the mapping bit and have it "just work" for tablespaces with in-place restore, or should we wait for the PGDATA in-place work to be complete first? And if so, is there a timeline for that?

DimCitus commented 2 years ago

Awesome, I'm happy that you like the idea. I think we should make the thing “just work” with table spaces first, which means introducing the configuration knob to default to removing PGDATA first. I suppose the PR that does that could skip the implementation of configuration knob turned to paranoid/forensic mode (keep PGDATA for as long as possible, use pg_basebackup --format=tar) and just raise an error for now. Another PR could implement that behaviour I suppose.

In terms of timeline, we are planning our next release to happen sometime in January. If this work misses the release schedule, then it will be part of the March release.

jchampio commented 2 years ago

My concern with ever using pg_basebackup --format=tar is that it not only requires you to have double the disk space available, but it also requires you to be able to fit the data from all tablespaces on a single filesystem, and then pay the cost of recopying all those bits across any mount points you have (since mv/rename() won't work, atomically at least). I think that's likely to be impossible for several real-world tablespace use cases.

DimCitus commented 2 years ago

My concern with ever using pg_basebackup --format=tar is that it not only requires you to have double the disk space available, but it also requires you to be able to fit the data from all tablespaces on a single filesystem, and then pay the cost of recopying all those bits across any mount points you have (since mv/rename() won't work, atomically at least). I think that's likely to be impossible for several real-world tablespace use cases.

I share the concern. I think using -z -Z9 and dealing with .tar.gz then is going to be necessary, but won't alleviate the problem entirely. At the moment I'm not seeing a better option though for the forensics options, and that's also why I am saying it should then stop being the default... activate only if you know what you're doing (either not using table spaces at all, or you live within the requirements for the forensics abilities, etc).

rheaton commented 2 years ago

Closing in favour of #870 per discussion above