teamhephy / workflow

Hephy Workflow - An open source fork of Deis Workflow - The open source PaaS for Kubernetes.
MIT License
406 stars 37 forks source link

error upgrading from very old install, hephy/postgres:v2.7.3 boots but hephy/postgres:v2.7.6 does not. #142

Closed n3wscott closed 3 years ago

n3wscott commented 3 years ago

Totally could be an edge case, but I am upgrading from a v2.12 install of workflow and I am hitting a looping error in the database controller:

2021-03-27 18:00:54.279 UTC [1] LOG:  server process (PID 1349) exited with exit code 2
2021-03-27 18:00:54.279 UTC [1337] LOG:  redo done at 4F7/4C026808
2021-03-27 18:00:54.279 UTC [1] LOG:  terminating any other active server processes
2021-03-27 18:00:54.280 UTC [1] LOG:  all server processes terminated; reinitializing
2021-03-27 18:00:54.289 UTC [1351] LOG:  database system was interrupted while in recovery at log time 2021-03-27 17:30:59 UTC
2021-03-27 18:00:54.289 UTC [1351] HINT:  If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2021-03-27 18:00:54.399 UTC [1351] LOG:  starting archive recovery
wal_e.operator.backup INFO     MSG: begin wal restore
        STRUCTURED: time=2021-03-27T18:00:54.695379-00 pid=1352 action=wal-fetch key=s3:***.lzo prefix= seg=00000001000004F70000004C state=begin
wal_e.main   CRITICAL MSG: An unprocessed exception has avoided all error handling
        DETAIL: Traceback (most recent call last):
          File "/usr/lib/python3.8/site-packages/wal_e/cmd.py", line 659, in main
            res = backup_cxt.wal_restore(args.WAL_SEGMENT,
          File "/usr/lib/python3.8/site-packages/wal_e/operator/backup.py", line 314, in wal_restore
            started = start_prefetches(seg, pd, prefetch_max)
          File "/usr/lib/python3.8/site-packages/wal_e/operator/backup.py", line 584, in start_prefetches
            with daemon.DaemonContext(stderr=open(os.devnull, 'w')):
          File "/usr/lib/python3.8/site-packages/wal_e/pep3143daemon/daemon.py", line 116, in __init__
            else detach_required()
          File "/usr/lib/python3.8/site-packages/wal_e/pep3143daemon/daemon.py", line 398, in detach_required
            if parent_is_inet() or parent_is_init():
          File "/usr/lib/python3.8/site-packages/wal_e/pep3143daemon/daemon.py", line 376, in parent_is_inet
            sock = socket.fromfd(
          File "/usr/lib/python3.8/socket.py", line 544, in fromfd
            return socket(family, type, proto, nfd)
          File "/usr/lib/python3.8/socket.py", line 231, in __init__
            _socket.socket.__init__(self, family, type, proto, fileno)
        OSError: [Errno 88] Not a socket

These error loops seemingly forever.

And v2.7.3 seems to boot after it preformed an upgrade.

Cryptophobia commented 3 years ago

Has this issue been resolved @n3wscott ? I noticed you mentioned that it worked in the Slack channel threads...

Maybe it was a case of not having a good backup or the particular latest wal-e backup was corrupt when the update was tried?

kingdonb commented 3 years ago

The failures could have been related to the invalid database password that we found, in order to do the upgrade from Deis Workflow to Hephy Workflow since K8s control plane had already been upgraded past K8s 1.16, the upgrade to latest Hephy Workflow (now with built-in support for K8s v1 APIs) would have failed to start controller and connect to database from it.

So we did helm uninstall / helm reinstall, after taking backups of every configmap, secret, deployment, daemonset, etc., helm get values [release-name], service, ingress, anything else that might be needed to put Deis back the way it was.

When it came back online after reinstall, the database database-creds had apparently been wiped and overwrote by a new generated credential, but s3 credentials were correct so database restore succeeded. (The new creds that did not match the creds required by database backup restored.)

But this was no problem, restoring the secret database-creds worked and controller connected successfully.

I'm not sure though. The errors given in @n3wscott's top post weren't in the controller, they are from postgres database.

Did you try upgrading to v2.7.6 again after that password issue was all resolved? Did it turn out to have anything to do with the database-creds issue, or is it still something else?

Cryptophobia commented 3 years ago

Sounds like database-creds was the issue here and may have been a configuration issue during the upgrade process... If we don't heard back, we should close this.

n3wscott commented 3 years ago

Thanks for the comments, let me try upgrading back to v2.7.6 and report back.

Cryptophobia commented 3 years ago

@n3wscott , never heard back anything. If it is still an issue, feel free to reopen this issue.