Closed iamtherobin closed 4 months ago
There are a few things going on here. Firstly, I believe your database is corrupted, caused by an unclean shutdown around 2024-06-12 20:20:03.454 UTC. Can you restore from backup?
2024-06-12 20:20:07.459 UTC [28] LOG: incorrect resource manager data checksum in record at 0/2F2AC0C0
2024-06-12 20:20:07.459 UTC [28] LOG: invalid primary checkpoint record
2024-06-12 20:20:07.459 UTC [28] PANIC: could not locate a valid checkpoint record
To fix the FATAL: role "root" does not exist, can you try implementing the fix in #10221 and report back?
Also you can try to remove the health-check section if the above suggestion doesn't work
There are a few things going on here. Firstly, I believe your database is corrupted, caused by an unclean shutdown around 2024-06-12 20:20:03.454 UTC. Can you restore from backup?
2024-06-12 20:20:07.459 UTC [28] LOG: incorrect resource manager data checksum in record at 0/2F2AC0C0 2024-06-12 20:20:07.459 UTC [28] LOG: invalid primary checkpoint record 2024-06-12 20:20:07.459 UTC [28] PANIC: could not locate a valid checkpoint record
To fix the FATAL: role "root" does not exist, can you try implementing the fix in #10221 and report back?
I am pretty much on the same conclusion about the db being corrupted. I don't have a backup so I will have to dump the server a start fresh. It's annoying even though nothing lost since I have all the original data. But it does worry me if I cannot figure out where I went wrong or if the update caused this.
I was already using the change posted in that ticket as of time of posting this ticket. I don't remember if I updated that health check to the compose file before or after the last time the FATAL line shows up in the log since I was attempting a sequence of compose edits to try to get the containers to start again.
I am now noticing that the last time the FATAL line shows up, it is timestamped to fit inside the previous log file, which is interesting.
Ok I THINK I figured out what happened here.
tl;dr: You cannot change the ".env" filename. It seems docker compose does not care what you supply in the env_file: property. It will always load the environment variables from "/.env". It seems the property can be dropped completely from the compose file and it will still read from .env
The fact that the corruption apparently happened in a log file that fits inside the timestamp of the previous log leaves me to believe the database got corrupted due to launching two postgres containers.
When I attempted to update from v1.105 to v1.106 the container was constantly restarting. Likely (but not certain) due to manually changes in the compose file. I attempted the make the yaml changes manually by referencing the new compose file. In hindsight, I should have a pulled a whole new compose file.
Since I started reading about others also having restarting containers with this update, I figured it is probably not my compose file that was the issue but maybe something else. So I thought I would test upgrading v1.105 to v1.106 and see if it would update on a cleanly without the container restarting on a fresh deployment.
To do that I created a new compose yaml file based on v1.105 with separate container names, a new environment file called "old.env" with properties for the v1.105 tag and a new library/postgres location so it would not conflict with my existing one. In the compose file I added "env_file: - old.env" to all the containers with variables. Logically, I assumed if the there is a env_file: property, it would obey it, right? Turns out it just loads the already present ".env" anyways. Which means it just attempted to pull v1.602 again and run it pointed to the same library and database, while using the compose template intended for v1.105.
While I cannot be absolutely certain at this point what actually happened, dual postgres instances pointing to the same database is the only explanation I can think of why one log is stamped to occur inbetween the previous one. It also the same point where the checksum error occurs.
The saving grace here is I deployed immich only a week ago and was still in the process of testing it for exactly this type of quirky situations before actually relying on it for production use (which is why I didn't setup a backup process yet).
I now launched a fresh v1.106.3 and it is working.
Docker compose will read .env and those settings can be used inside the compose file itself. Unrelated if the env_file property which passes the file to the container.
The bug
I tried to upgrade from v1.105.1 to v1.106.2 by following modifying the yaml file accordingly and pulling the new images. The result was restarting immich server container. Then I tried reverting back to v1.105.1 but now I am getting a postgres container error. I cannot get either v1.105.1 or newer to load anymore.
I am currently trying to load v1.106.3 and postgres keeps restarting at "PostgreSQL Database directory appears to contain a database". Both postgres and server container repeat the error shown in the docker log.
I also attached the postgres logs from the database log folder. The postgres start from where I attempted the v1.106 update upto the point where it just keeps repeating.
The OS that Immich Server is running on
Arch, kernel 6.8.9
Version of Immich Server
v1.106.3
Version of Immich Mobile App
latest
Platform with the issue
Your docker-compose.yml content
Your .env content
Reproduction steps
Relevant log output
Additional information