truenas / charts

TrueNAS SCALE Apps Catalogs & Charts
BSD 3-Clause "New" or "Revised" License
302 stars 291 forks source link

Immich upgrade always fails with "timed out waiting for the condition" #1687

Closed Chaphasilor closed 11 months ago

Chaphasilor commented 11 months ago

I've been running Immich on my Truenas SCALE machine for a few weeks now and in that time updated about 5 times.
Every time, I got the following error: image

Full error log:

Error: Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 427, in run
    await self.future
  File "/usr/lib/python3/dist-packages/middlewared/job.py", line 465, in __run_body
    rv = await self.method(*([self] + args))
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1379, in nf
    return await func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/schema.py", line 1247, in nf
    res = await f(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/chart_releases_linux/upgrade.py", line 115, in upgrade
    await self.upgrade_chart_release(job, release, options)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/chart_releases_linux/upgrade.py", line 298, in upgrade_chart_release
    await self.middleware.call('chart.release.helm_action', release_name, chart_path, config, 'upgrade')
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1368, in call
    return await self._call(
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1328, in _call
    return await self.run_in_executor(prepared_call.executor, methodobj, *prepared_call.args)
  File "/usr/lib/python3/dist-packages/middlewared/main.py", line 1231, in run_in_executor
    return await loop.run_in_executor(pool, functools.partial(method, *args, **kwargs))
  File "/usr/lib/python3.9/concurrent/futures/thread.py", line 52, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/usr/lib/python3/dist-packages/middlewared/plugins/chart_releases_linux/helm.py", line 44, in helm_action
    raise CallError(f'Failed to {tn_action} chart release: {stderr.decode()}')
middlewared.service_exception.CallError: [EFAULT] Failed to upgrade chart release: Error: UPGRADE FAILED: pre-upgrade hooks failed: timed out waiting for the condition

It seems like the upgrade itself is usually working though, and the version is up to date after refreshing the apps list.
For the upgrade from v1.81.1 to v1.82.1, my installation got corrupted and didn't come back up anymore. Luckily the pgBackup pod did create a DB dump, and I was able to re-install and import the old database.

I'm running TrueNAS-SCALE-22.12.4.2, and the issue also happend with an older version of Bluefin. The error also happens with two separate fresh installs, so that doesn't seem to be an issue either.

If someone could look into this, I would really appreciate it. Recovering my database wasn't easy because it involved messing around with kubectl, so I would be relieved if I didn't have to worry anymore when upgrading to a new version of Immich :)

If any additional info is needed, please let me know!

stavros-k commented 11 months ago

Hello, it looks like the pre-upgrade job is failing, (its the job that takes a database backup before upgrade). I've tried few upgrades myself and all succeeded.

Reason that I can think of that this could fail is:

But you mention that the db dump was created which makes me super curious.

Once the job is started it runs these: https://github.com/truenas/charts/blob/58967953ec2f1e4b3fffdcc7238dc2e1f0f0cb7b/library/common/templates/app_functions/_postgres.tpl#L120-L123

Next time you get the error can you please check the logs of that job? If you create a Debug artifact right after the failure it should include those logs to retrieve easily.

You can create the Debug artifact from System Settings -> Advanced -> Save Debug button on the top right corner. Note: Do not share publicly the debug file as it might contain private details.

Chaphasilor commented 11 months ago

Thanks for the response, I'll do it for the 1.83.1 once that is released. Could cou clarify when and where exactly I should run those four commands? In the TrueNAS shell? Within the pgBackup pod? Or somewhere else entirely? πŸ€”

stavros-k commented 11 months ago

Thanks for the response, I'll do it for the 1.83.1 once that is released. Could cou clarify when and where exactly I should run those four commands? In the TrueNAS shell? Within the pgBackup pod? Or somewhere else entirely? πŸ€”

Oh, sorry you don't have to run those! I just referencing what the pre-upgrade job is running. Not much that can go wrong there, unless the database is either not responding or the backup taking too long.

Chaphasilor commented 11 months ago

Right, I mistook "it runs these" for "run these". My bad. I'll get back to you after the update!

Just FYI, my Immich library has about 8k photos, 300 videos and 41 GB, and the uncompressed SQL dump is ~90 MB. It took only a few seconds to create when I did it manually before this update

stavros-k commented 11 months ago

Just FYI, my Immich library has about 8k photos, 300 videos and 41 GB, and the uncompressed SQL dump is ~90 MB. It took only a few seconds to create when I did it manually before this update

Yea, the size and duration you mention is in the ballpark I would expect

Chaphasilor commented 11 months ago

Just updated to v1.84 and this time there was no error, naturally πŸ˜…
What I did differently this time is not upgrading the machine learning container beforehand. I have set up the immich-machine-learning docker container separately and disabled the bundled ML container because that was giving me trouble at first. Usually I first shut down immich, then upgrade immich-machine-learning, then upgrade immich and start it back up (in hopes of not running into incompatibilities between immich and the immich-machine-learning).
I'll try to do that again next time to reproduce the issue

stavros-k commented 11 months ago

Just updated to v1.84 and this time there was no error, naturally πŸ˜… What I did differently this time is not upgrading the machine learning container beforehand. I have set up the immich-machine-learning docker container separately and disabled the bundled ML container because that was giving me trouble at first. Usually I first shut down immich, then upgrade immich-machine-learning, then upgrade immich and start it back up (in hopes of not running into incompatibilities between immich and the immich-machine-learning). I'll try to do that again next time to reproduce the issue

Ah, I see, if you trigger the update while the app is stopped, the backup job cannot complete, because the database is stopped.

Chaphasilor commented 11 months ago

That does make sense. So you'll probably need to spin up the immich-postgres container too?
Btw, getting debug logs didn't work either πŸ˜…
image

stavros-k commented 11 months ago

That does make sense. So you'll probably need to spin up the immich-postgres container too?

There isn't a clean way to do that.

Case 1: Postgres container runs already and is reachable -> Proceed with the backup -> No problem!

Case 2: Postgres container does not run and is not reachable -> Proceed with starting a container in the background, do the backup. Stop the container. -> No problem!

Case 3: Postgres container runs already but is not reachable. If you try to spin up a postgres container using the same data directory, there is a big chance to corrupt data since it will try to restore the write ahead log, but the already running container already manages that. Putting both container in a bad state. -> Problem!!

Next SCALE release will include some extra metadata available to the Chart so it should handle upgrade from stopped state better along with this. But this also means that there wont a backup during the upgrade from stopped state, as the job will never fire or will fire but never complete (Need to check this!). But we can probably detect that with the extra metadata and provide a more useful error message, but TL;DR is that its better to upgrade from a running state, so the flow completes as it should.

In case you are wondering.. "Why not just check if the X container is running using X command?"

Well, Apps are just a GUI for generating manifests using Helm, that then get sent to Kubernetes. Some shell scripts can be added in the containers to run at startup, but you cannot interact with the host to check what is running or not. Well you can but only if you give all the nasty elevated permissions to the container, which for obvious reasons you don't wanna do that.


Regarding the save debug process not completing, I'd suggest opening a ticket in https://jira.ixsystems.com in order to be checked.

Chaphasilor commented 11 months ago

Alright, I was hoping the new SCALE release would also bring some improvements for apps under the hood and not just a UI overhaul, good to hear!
TrueCharts always put a checkbox at the end of the config/edit dialog, "I have checked the documentation" and such, maybe you could also include something similar where you mention that it's best to upgrade from a running state without shutting down? 😁

Chaphasilor commented 11 months ago

Sweet, thanks for that! I take it this behavior will be available after refreshing the chart for the next Immich upgrade? If so, I'd go ahead and try it out asap! 😁

stavros-k commented 11 months ago

Not yet, the "code" will be on the app on the next immich App release, but the actual metadata will only exist on the next Cobia release. Until then it will work as it does now.

Chaphasilor commented 11 months ago

Got it. If I don't forget about it, I'll check back once I upgraded to Cobia!