sorintlab / stolon

PostgreSQL cloud native High Availability and more.
https://talk.stolon.io
Apache License 2.0
4.62k stars 444 forks source link

stolon in docker swarm breaks after upgrading docker 19.x to 20.x with multiple ips #835

Open johannesboon opened 3 years ago

johannesboon commented 3 years ago

What happened:

pg_proxy is switching multiple times per minute between 2 different IP-addresses (breaking any existing connections/transactions/queries) to the master keeper, after upgrading Docker from 19.03.13 to 20.10.6

Probably caused by: https://github.com/moby/moby/pull/39204

This sounds similar but is not directly related as the 2 IP-addresses are seen from the outside as well: https://github.com/moby/moby/issues/30963 (or at least the 2021 comment: https://github.com/moby/moby/issues/30963#issuecomment-774527403 from https://github.com/pwFoo )

What you expected to happen:

  1. Not break my cluster with the reference configuration.
  2. Consistenly handle cases where multiple IP-addresses are available (ordered numerically, instead of random order from Docker DNS?)

How to reproduce it (as minimally and precisely as possible):

With our setup based on this example: https://github.com/sorintlab/stolon/blob/master/examples/swarm/docker-compose-pg.yml#L24 we also defined:

we also added a placement constraints, amongst others

Anything else we need to know?:

Environment:

Main yum repositories involved:

Upgrading Docker CE from 19.x to 20.x involved these packages (I included anything I thought could be remotely related):


    Updated     bind-export-libs-32:9.11.4-26.P2.el7_9.4.x86_64          @ol7_latest
    Update                       32:9.11.4-26.P2.el7_9.5.x86_64          @ol7_latest
    Updated     bind-libs-32:9.11.4-26.P2.el7_9.4.x86_64                 @ol7_latest
    Update                32:9.11.4-26.P2.el7_9.5.x86_64                 @ol7_latest
    Updated     bind-libs-lite-32:9.11.4-26.P2.el7_9.4.x86_64            @ol7_latest
    Update                     32:9.11.4-26.P2.el7_9.5.x86_64            @ol7_latest
    Updated     container-selinux-2:2.77-5.el7.noarch                    @ol7_addons
    Update                        2:2.119.2-1.911c772.el7_8.noarch       @extras
    Updated     containerd.io-1.3.7-3.1.el7.x86_64                       @docker-ce-stable
    Update                    1.4.4-3.1.el7.x86_64                       @docker-ce-stable
    Updated     docker-ce-3:19.03.13-3.el7.x86_64                        @docker-ce-stable
    Update                3:20.10.6-3.el7.x86_64                         @docker-ce-stable
    Updated     docker-ce-cli-1:19.03.13-3.el7.x86_64                    @docker-ce-stable
    Update                    1:20.10.6-3.el7.x86_64                     @docker-ce-stable
    Dep-Install docker-ce-rootless-extras-20.10.6-3.el7.x86_64           @docker-ce-stable
    Dep-Install docker-scan-plugin-0.7.0-3.el7.x86_64                    @docker-ce-stable
    Updated     firewalld-filesystem-0.6.3-12.0.1.el7.noarch             @ol7_latest
    Update                           0.6.3-13.0.1.el7_9.noarch           @ol7_latest
    Dep-Install fuse-overlayfs-0.7.2-6.el7_8.x86_64                      @extras
    Dep-Install fuse3-libs-3.6.1-4.el7.x86_64                            @extras
    Erase       kernel-uek-4.14.35-2025.404.1.1.el7uek.x86_64            @ol7_UEKR5
    Install     kernel-uek-4.14.35-2047.503.1.el7uek.x86_64              @ol7_UEKR5
    Updated     kernel-uek-tools-4.14.35-2047.501.1.el7uek.x86_64        @ol7_UEKR5
    Update                       4.14.35-2047.503.1.el7uek.x86_64        @ol7_UEKR5
    Updated     lvm2-7:2.02.187-6.0.3.el7_9.3.x86_64                     @ol7_latest
    Update           7:2.02.187-6.0.3.el7_9.5.x86_64                     @ol7_latest
    Updated     lvm2-libs-7:2.02.187-6.0.3.el7_9.3.x86_64                @ol7_latest
    Update                7:2.02.187-6.0.3.el7_9.5.x86_64                @ol7_latest
    Updated     nss-3.53.1-3.el7_9.x86_64                                @ol7_latest
    Update          3.53.1-7.el7_9.x86_64                                @ol7_latest
    Updated     nss-sysinit-3.53.1-3.el7_9.x86_64                        @ol7_latest
    Update                  3.53.1-7.el7_9.x86_64                        @ol7_latest
    Updated     nss-tools-3.53.1-3.el7_9.x86_64                          @ol7_latest
    Update                3.53.1-7.el7_9.x86_64                          @ol7_latest
    Updated     selinux-policy-3.13.1-268.0.1.el7_9.2.noarch             @ol7_latest
    Update                     3.13.1-268.0.3.el7_9.2.noarch             @ol7_latest
    Updated     selinux-policy-targeted-3.13.1-268.0.1.el7_9.2.noarch    @ol7_latest
    Update                              3.13.1-268.0.3.el7_9.2.noarch    @ol7_latest
    Dep-Install slirp4netns-0.4.3-4.el7_8.x86_64                         @extras``` 
johannesboon commented 3 years ago

Ah, this was also reported in: https://github.com/sorintlab/stolon/issues/826

sgotti commented 3 years ago

@johannesboon I think the main issue is just in the docker example that uses hostnames instead of ips. When the dns like in this case becomes a round robin dns you get such issues. A working fix without using an advertising address like done in #836 could be to just use the ip of the container as listen address instead of the hostname (like done in the k8s example where we are using the pod ip).