sorintlab / stolon

PostgreSQL cloud native High Availability and more.
https://talk.stolon.io
Apache License 2.0
4.62k stars 444 forks source link

gh-849: To check the cluster replication type based on cluster spec instead of masterdb spec #850

Open viggy28 opened 2 years ago

viggy28 commented 2 years ago

Addresses https://github.com/sorintlab/stolon/issues/849

  1. When the primary failed and sync replica was failing and sentinel assigned SR as new primary
2021-10-06T14:44:31.522-0700    WARN    cmd/sentinel.go:276 no keeper info available    {"db": "f9aca2fc", "keeper": "postgres1"}
2021-10-06T14:44:31.522-0700    WARN    cmd/sentinel.go:276 no keeper info available    {"db": "1fe806fe", "keeper": "postgres3"}
2021-10-06T14:44:36.740-0700    WARN    cmd/sentinel.go:276 no keeper info available    {"db": "f9aca2fc", "keeper": "postgres1"}
2021-10-06T14:44:36.740-0700    WARN    cmd/sentinel.go:276 no keeper info available    {"db": "1fe806fe", "keeper": "postgres3"}
2021-10-06T14:44:36.750-0700    INFO    cmd/sentinel.go:995 master db is failed {"db": "f9aca2fc", "keeper": "postgres1"}
2021-10-06T14:44:36.750-0700    INFO    cmd/sentinel.go:1006    trying to find a new master to replace failed master
2021-10-06T14:44:36.750-0700    INFO    cmd/sentinel.go:1032    electing db as the new master   {"db": "1fe806fe", "keeper": "postgres3"}
2021-10-06T14:44:42.018-0700    WARN    cmd/sentinel.go:276 no keeper info available    {"db": "f9aca2fc", "keeper": "postgres1"}
  1. However SR also failed.

    2021-10-06T14:44:47.334-0700    INFO    cmd/sentinel.go:1006    trying to find a new master to replace failed master
    2021-10-06T14:44:47.334-0700    WARN    cmd/sentinel.go:1016    cannot choose synchronous standby since there are no common elements between the latest master reported synchronous standbys and the db spec ones   {"reported": [], "spec": ["f9aca2fc"]}
    2021-10-06T14:44:47.334-0700    ERROR   cmd/sentinel.go:1035    no eligible masters
    2021-10-06T14:44:52.581-0700    WARN    cmd/sentinel.go:276 no keeper info available    {"db": "f9aca2fc", "keeper": "postgres1"}
    2021-10-06T14:44:52.581-0700    WARN    cmd/sentinel.go:276 no keeper info available    {"db": "1fe806fe", "keeper": "postgres3"}
  2. Disabled synchronous replication and sentinel picked ASR as the new primary

    2021-10-06T14:45:39.779-0700    INFO    cmd/sentinel.go:995 master db is failed {"db": "1fe806fe", "keeper": "postgres3"}
    2021-10-06T14:45:39.779-0700    INFO    cmd/sentinel.go:1001    db not converged    {"db": "1fe806fe", "keeper": "postgres3"}
    2021-10-06T14:45:39.779-0700    INFO    cmd/sentinel.go:1006    trying to find a new master to replace failed master
    2021-10-06T14:45:39.779-0700    INFO    cmd/sentinel.go:1032    electing db as the new master   {"db": "9665a7da", "keeper": "postgres2"}
    2021-10-06T14:45:45.068-0700    WARN    cmd/sentinel.go:276 no keeper info available    {"db": "f9aca2fc", "keeper": "postgres1"}
    2021-10-06T14:45:45.068-0700    WARN    cmd/sentinel.go:276 no keeper info available    {"db": "1fe806fe", "keeper": "postgres3"}
    2021-10-06T14:45:50.335-0700    WARN    cmd/sentinel.go:276 no keeper info available    {"db": "1fe806fe", "keeper": "postgres3"}
    2021-10-06T14:45:50.335-0700    WARN    cmd/sentinel.go:276 no keeper info available    {"db": "f9aca2fc", "keeper": "postgres1"}
    2021-10-06T14:45:50.345-0700    INFO    cmd/sentinel.go:1151    removing old master db  {"db": "1fe806fe", "keeper": "postgres3"}
    2021-10-06T14:45:50.345-0700    INFO    cmd/sentinel.go:1151    removing old master db  {"db": "f9aca2fc", "keeper": "postgres1"}
viggy28 commented 2 years ago

integration tests are failing. Not sure why.

a. Also, I am unable to restart it (the only way I can trigger that right now is by pushing a commit)

sgotti commented 2 years ago

@viggy28 They're all failing in the same 4 tests cases. So your changes are affecting them in some ways. You should run the specific tests locally to better understand the reason.