microsimulation / ijm

A central place for general issues, documents, scripts and resources for the IJM
https://microsimulation.org/ijm/
MIT License
4 stars 1 forks source link

Website slow, actions failing, seeing a lot of 500s in logs #202

Closed erkannt closed 9 months ago

erkannt commented 10 months ago

Refs: #200

erkannt commented 10 months ago

Error message of failed main.yml GH Action run indicates that our EC2 node has run out of space. As I can't access the node via SSH I will attemp to recreate the node using the ijm-infra repo. Previously this caused an IP to change which we needed to then specify via GH Action secrets so that the deploy pipeline of this repo could succeed (see #160).

erkannt commented 10 months ago

The culprit of the space issue is the overlay2 folder of docker (/var/lib/docker/overlay2).

Investigating root cause:

Identify which folders are consuming space in overlay2:

du -s /var/lib/docker/overlay2/*/diff |sort -n -r   # identify critical folder(s)

Find the correspondig docker container:

docker inspect $(docker container ls -q) | grep PART-OF-OFFENDING-FOLDER-NAME -B 300 -A 300 | less  

Find out which folders where added or changed since container creation:

docker diff ijm-prod_journal_1 | grep '^A\|^C' | cut -f 2 -d " " | sort
erkannt commented 10 months ago

Looking at the output of docker exec -it ijm-prod_journal_1 sh -c 'du -sh /app/var/*' we should probably keep an eye on the size of the cache. Given the fact that I found a 2GB log file floating around on the server that looks like it was created by the application the logs folder also needs watching.

109M    /app/var/cache
576K    /app/var/logs
erkannt commented 10 months ago

After a deploy the fresh container's var dir is a decent chunk smaller:

25M     /app/var/cache
36K     /app/var/logs
erkannt commented 10 months ago

Closing as we have mitigated the issue and now have ways to resolve this more quickly in the future:

  1. Access ec2 node via SSH using credentials stored in Sciety 1password
  2. docker system prune -a or recreate journal using deploy.sh in /home/ec2-user/ijm-prod

Currently the node disk is at 50% with 8.6G free space. Assets are currently 2.5G, the containers and images in a clean state seem to consume 3.7G.

/cc @BlueReZZ @pbronka

pbronka commented 10 months ago

Thank you! @erkannt and @BlueReZZ for looking into this and fixing the problem, we're all very happy that the website works well again

pbronka commented 10 months ago

Hi @erkannt , I'm going to re-open this issue because I tried making a very small change fixing a typo in an article and the CI tests have failed. Do you have any thoughts on what might be going on here https://github.com/microsimulation/ijm/actions/runs/7151707128/job/19476285635 ?

erkannt commented 9 months ago

The pipeline is failing due to an unrelated issue. There is a failing feature test. I have created a new ticket (#204) in favour of polluting this one.