mikeizbicki / cmc-csci143

big data course materials
40 stars 76 forks source link

Disk Quota, logging issues #566

Closed ains-arch closed 6 months ago

ains-arch commented 6 months ago

I'm trying to add the millions of rows to my prod database. I'm making my own dataset, rather than using the twitter dataset, so I think I may be doing a bad job of managing how the data is stored. I have taken down the containers and removed the volumes and restarted my process, but I've run into -bash: cannot create temp file for here-document: Disk quota exceeded problems twice.

I have run the du -hd1 command in my home directory, and the issue seems to be due to the size of the .local/share/docker directory, specifically the containers and overlay2 directories.

$ du -hd1 .local/share/docker | sort -rh
du: cannot read directory '.local/share/docker/overlay2/04c129b226f15c6105df277e89d13d431fa12df1df948df36e40d2d479b47bc1/work/work': Permission denied
...
11G     .local/share/docker
5.4G    .local/share/docker/containers
5.0G    .local/share/docker/overlay2
20M     .local/share/docker/image
...
120K    .local/share/docker/volumes
...
4.0K    .local/share/docker/runtimes

Here is my prod dockerfile:

$ cat docker-compose.prod.yml
version: '3.8'

services:
  web:
    build:
      context: ./services/web
      dockerfile: Dockerfile.prod
    command: gunicorn --bind 0.0.0.0:5000 manage:app
    volumes:
      - static_volume:/home/app/web/project/static
      - media_volume:/home/app/web/project/media
    expose:
      - 5000
    env_file:
      - ./.env.prod
    depends_on:
      - db
  db:
    build:
      context: ./services/postgres
      dockerfile: Dockerfile.prod
    ports:
      - 1467:5432
    volumes:
      - $HOME/bigdata/postgres_data_prod:/home/app/postgres/data
    env_file:
      - ./.env.prod.db
  nginx:
    build: ./services/nginx
    volumes:
      - static_volume:/home/app/web/project/static
      - media_volume:/home/app/web/project/media
    ports:
      - 1447:80
    depends_on:
      - web

volumes:
  postgres_data_prod:
  static_volume:
  media_volume:

It seems like I'm currently mounting the database to the bigdata folder, that doesn't count to my disck quota. The other volumes should be very small.

I think the problem is maybe in the size of the logs. My random data generation and insertion code is very... bad, and the way I'm handling the constraints of the database is to try to add the random data and if it breaks uniques or foreign key constraints, to roll it back. But when I look at

$ docker-compose -f docker-compose.prod.yml logs db

it has every single failed insertion attempt in there and is, therefore, huge. I tried to find where the logs are stored in the docker container's command line by doing

$ docker exec -it 3bd3694c1bc5 /bin/bash

on the container with the prod database, but it doesn't seem like there's anything in /var/log/postgresql and I don't know where else they would be.

Would appreciate any tips on how to handle clearing the logs, or disk usage in general, or if I just need to rewrite my data insertion script so that it doesn't constantly throw and ignore errors.

cc: @westondcrewe https://github.com/mikeizbicki/cmc-csci143/issues/561#issuecomment-2094942254


update I took down the containers and removed my volumes because I wasn't close to 10 million rows anyway. That directly reclaimed 1.727GB and seems to have indirectly cleared the containers folder. There was still 5G in .local/share/docker/overlay2 so I also ran docker system prune and I'm just crossing my fingers that deleting networks and images to reclaim another 2.05 GB didn't break anything.

I would still really like to know how to avoid getting to this point again, especially if I'm right that part of my problem is the log files.

mikeizbicki commented 6 months ago

I was going to suggest all of the things you just included in your edit. So it seems to me like you're on the right track.

Log files shouldn't be consuming much disk space (if everything is working correctly). But if you are concerned about it, you can disable logging for a container by following these instructions: https://stackoverflow.com/questions/34590317/disable-logging-for-one-container-in-docker-compose.