langgenius / dify

Dify is an open-source LLM app development platform. Dify's intuitive interface combines AI workflow, RAG pipeline, agent capabilities, model management, observability features and more, letting you quickly go from prototype to production.
https://dify.ai
Other
51.81k stars 7.51k forks source link

The pg database running inside docker container fails to start up after Host computer restarting #8796

Closed GangLiCN closed 1 month ago

GangLiCN commented 1 month ago

Self Checks

Dify version

0.8.3

Cloud or Self Hosted

Self Hosted (Docker)

Steps to reproduce

Issue:
I have a local dify web app running on windows based VM instance(Ubuntu 22.04 for WSL2), Unfortunately this env is not ready after the VM box restarts(because Host laptop shuts down every day)

Error logs: From the output of "docker-compose logs", it appears that postgreSQL database fails to start up due to invalid checkpoint record .

Analysis: It looks that this is caused by pgdb's abnormal shutdown. But I can't find any official document talking about this. So, Does it mean " User must shutdown database(for container "dify_db" usage) normally and do a full backup in case the database may not start up correctly " ?

✔️ Expected Behavior

Better pg database backup and recovery strategy especically for docker or k8s env.

❌ Actual Behavior

The PostgreSQL database used by app is crashed and can not be recovered unless 

you have a full database backup or have correct and complete WAL log files for recovery.

dify_docker_pgdb_env_crashed_after_reboot

dosubot[bot] commented 1 month ago

@GangLiCN Hello there! I'm here to assist you with any bugs, questions, or contributions. Let's tackle this issue together while we wait for a human maintainer.

To ensure that the PostgreSQL database used by the dify_db container starts correctly after a physical box restart or shutdown, you can use the restart: always policy in your Docker Compose configuration. This policy ensures that the container will always restart if it stops, including after a system reboot.

Here is the relevant part of your docker-compose.yaml:

services:
  pgvecto-rs:
    image: tensorchord/pgvecto-rs:pg16-v0.2.0
    restart: always
    environment:
      PGUSER: postgres
      POSTGRES_PASSWORD: difyai123456
      POSTGRES_DB: dify
      PGDATA: /var/lib/postgresql/data/pgdata
    volumes:
      - ./volumes/pgvectors/data:/var/lib/postgresql/data
    ports:
      - "5431:5432"
    healthcheck:
      test: [ "CMD", "pg_isready" ]
      interval: 1s
      timeout: 3s
      retries: 30

For best practices in backup and recovery in a Docker or Kubernetes environment, you should:

  1. Regular Backups: Schedule regular backups of your PostgreSQL database. You can use tools like pg_dump for logical backups or pg_basebackup for physical backups.

  2. Volume Mounts: Ensure that your PostgreSQL data is stored in a Docker volume or a persistent volume in Kubernetes. This allows the data to persist even if the container is destroyed.

  3. Automated Backup Scripts: Create automated scripts to back up your data and store it in a secure location, such as an S3 bucket or another cloud storage service.

  4. Disaster Recovery Plan: Have a disaster recovery plan in place that includes steps to restore your database from backups.

  5. Health Checks: Use health checks to monitor the status of your PostgreSQL container and ensure it is running correctly.

Here is an example of how you might set up a backup script in a Docker environment:

#!/bin/bash

# Define backup directory and filename
BACKUP_DIR=/path/to/backup/dir
BACKUP_FILE=$BACKUP_DIR/backup_$(date +%Y%m%d%H%M%S).sql

# Run pg_dump to create a backup
docker exec -t dify_db pg_dumpall -c -U postgres > $BACKUP_FILE

# Optional: Upload the backup to S3
aws s3 cp $BACKUP_FILE s3://your-bucket-name/$(basename $BACKUP_FILE)

In a Kubernetes environment, you can use CronJobs to schedule backups:

apiVersion: batch/v1beta1
kind: CronJob
metadata:
  name: postgres-backup
spec:
  schedule: "0 2 * * *"  # Every day at 2 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: postgres-backup
            image: postgres:15-alpine
            env:
            - name: PGUSER
              value: "postgres"
            - name: PGPASSWORD
              value: "difyai123456"
            command: ["/bin/sh", "-c", "pg_dumpall -c -U $PGUSER > /backup/backup_$(date +%Y%m%d%H%M%S).sql"]
            volumeMounts:
            - name: backup-volume
              mountPath: /backup
          restartPolicy: OnFailure
          volumes:
          - name: backup-volume
            persistentVolumeClaim:
              claimName: your-pvc-name

By following these practices, you can ensure that your PostgreSQL database is resilient and can be recovered in case of failures [1][2][3][4].

To continue talking to Dosu, mention @dosu.

GangLiCN commented 1 month ago

First, thanks for the AI based agent's quickly response ! But, it still can't fix my issue: 1) The "restart: always" policy is already set in my docker-compose configuration file;

2) Bases on current state, it looks that the only effective way is to re-creating containers because there is no any database backup ?

crazywoola commented 1 month ago

You can manually backup the db by using the pgdump or other tools. Sorry we do not provide a. backup strategy for community version. This is a system-specific problem, we only make sure it can pass the the CI.

GangLiCN commented 1 month ago

No, I still think this is a very "bad" stability issue. Is it a must-to-do job that user must shutdown pg database before closing VM ? If yes. I would say this is a very "bad" design or implement architecture...

kurokobo commented 1 month ago

@crazywoola @GangLiCN Hi, sorry for adding a new comment to the closed issue, but what do you think about making stop_grace_period configurable, since Docker sends SIGKILL 10 seconds after SIGTERM by default?

I mean, during the shutdown of your PC, if your PSQL takes more than 10 seconds to shut down, it will be killed by SIGKILL, which may cause data corruption. Docker can wait more than 10 seconds before sending SIGKILL by increasing stop_grace_period. It may help reduce the risk of data corruption in an environment that frequently shuts down and starts up.