vfedotovs / sslv_web_scraper

ss.lv web scraping app helps automate information scraping and filtering from classifieds and emails results and stores scraped data in database
GNU General Public License v3.0
5 stars 3 forks source link

FEAT(CICD): Improve DB backup workflow in EC2 instance #287

Open vfedotovs opened 3 months ago

vfedotovs commented 3 months ago

Current behavior backup files do not contain hour_min_sec in filename 122712 Aug 8 08:05 pg_backup_2024_08_08.sql 122689 Aug 7 08:05 pg_backup_2024_08_07.sql

Backup does not have logic to upload last file

 crontab -l
5 6 * * * docker exec -t $(docker ps| grep db-1| awk '{print $NF}')  pg_dump -U  DB-USER -d DB-NAME > /tmp/pg_backup_$(date +\%Y_\%m_\%d).sql
7 6 * * * aws s3 cp /tmp/pg_backup_$(date +\%Y_\%m_\%d).sql  s3://bucket-name-pg-backups/pg_backup_$(date +\%Y_\%m_\%d).sql

Possible solution

Improved cron job version example 5 6 * \ docker exec -t $(docker ps --filter "name=db-1" --format "{{.Names}}") \ pg_dump -U DB-USER -d DB-NAME | gzip > /tmp/pgbackup$(date +\%Y\%m\%d).sql.gz \ 2>> /var/log/pg_backuperror.log && echo "$(date +\%Y\%m\%d\%H:%M:%S) Backup successful" >> /var/log/pg_backup_success.log

notes $(docker ps --filter "name=db-1" --format "{{.Names}}"): This refines the docker ps command to specifically target the container name that matches "db-1", which is more reliable than using grep.

gzip: The backup is compressed with gzip to save space.

/tmp/pgbackup$(date +\%Y\%m\%d).sql.gz: The backup file is saved with a .sql.gz extension to indicate that it’s compressed.

Error Logging: 2>> /var/log/pg_backup_error.log redirects any errors to a specific log file.

Success Logging: echo "$(date +\%Y\%m\%d_\%H:%M:%S) Backup successful" >> /var/log/pg_backup_success.log logs a success message along with a timestamp if the backup completes successfully.

Backup Retention Policy: Consider setting up a job to remove old backups after a certain period (e.g., 30 days).

bash Copy code 0 0 find /tmp/pgbackup.sql.gz -mtime +30 -exec rm {} \;

vfedotovs commented 6 days ago

Containerised solution proposal:

Following code will :

# backup_upload.py
import boto3
import subprocess
import datetime
import os

def backup_postgres():
    # Define backup filename with current date
    date_str = datetime.datetime.now().strftime("%Y_%m_%d")
    backup_filename = f"/tmp/pg_backup_{date_str}.sql"

    # Run pg_dump command to create the backup
    subprocess.run([
        "pg_dump", 
        "-U", "DB-USER", 
        "-h", "db",           # Assuming the db container is named `db` in the network
        "-d", "DB-NAME",
        "-f", backup_filename
    ], check=True)

    return backup_filename

def upload_to_s3(file_path, bucket_name, object_name=None):
    # Initialize the boto3 client
    s3_client = boto3.client('s3', region_name='your-region')  # Replace 'your-region' with your S3 region

    # Define object name in S3 if not provided
    if not object_name:
        object_name = os.path.basename(file_path)

    # Upload the file to the specified S3 bucket
    s3_client.upload_file(file_path, bucket_name, object_name)
    print(f"Uploaded {file_path} to S3 bucket {bucket_name}")

if __name__ == "__main__":
    # Generate the backup
    backup_file = backup_postgres()

    # Upload backup to S3
    s3_bucket_name = "bucket-name-pg-backups"  # Replace with your bucket name
    upload_to_s3(backup_file, s3_bucket_name)

Dockerfile for conatianer:

# Dockerfile
FROM python:3.9

# Install PostgreSQL client
RUN apt-get update && apt-get install -y postgresql-client && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Copy the Python script into the container
COPY backup_upload.py /app

# Install Python dependencies
RUN pip install boto3

# Environment variables for AWS credentials
ENV AWS_ACCESS_KEY_ID=your-access-key-id
ENV AWS_SECRET_ACCESS_KEY=your-secret-access-key

# Run the Python script
CMD ["python", "backup_upload.py"]

Docker compose example

version: '3.8'
services:
  db-backup:
    build: .
    environment:
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    depends_on:
      - db
    restart: on-failure
    entrypoint: ["cron", "-f"]
    volumes:
      - db_data:/var/lib/postgresql/data

volumes:
  db_data:
    external: true
# cronfile
0 6 * * * python /app/backup_upload.py

Run docker compose

docker-compose up --build -d
vfedotovs commented 6 days ago

Updated docker compose that will run sheduled cron backup

FROM python:3.9

# Install PostgreSQL client and cron
RUN apt-get update && apt-get install -y postgresql-client cron && rm -rf /var/lib/apt/lists/*

# Set working directory
WORKDIR /app

# Copy application code and cronfile into the container
COPY backup_upload.py /app
COPY cronfile /etc/cron.d/db-backup-cron

# Install Python dependencies
RUN pip install boto3

# Set permissions for cronfile and give execution permission
RUN chmod 0644 /etc/cron.d/db-backup-cron && crontab /etc/cron.d/db-backup-cron

# Create a log file for cron output
RUN touch /var/log/cron.log

# Run the cron daemon in foreground (for Docker compatibility)
CMD cron -f

New docker compose file:

version: '3.8'
services:
  db-backup:
    build: .
    environment:
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
    depends_on:
      - db
    restart: on-failure
    volumes:
      - db_data:/var/lib/postgresql/data
volumes:
  db_data:
    external: true

cronfile

0 6 * * * python /app/backup_upload.py >> /var/log/cron.log 2>&1

How the Cron Job Works Inside the Container Container Starts and Runs cron: When the container starts, the CMD cron -f in the Dockerfile starts the cron daemon in the foreground, allowing it to continue running as the container’s main process.

Cron Reads the cronfile: The crontab loads the cronfile with the scheduled task(s), setting it to execute the Python backup script at the specified time.

Daily Execution: At the scheduled time, cron triggers the backup_upload.py script inside the container. The Python script runs, generates a backup, and uploads it to S3, as defined in your script.

Log Output: The cron task’s output is logged to /var/log/cron.log inside the container, which can be reviewed for success or errors