Docker deamon for ERDDAP hosted on AWS keeps crashing

ioos / erddap-gold-standard

Contains the 'gold standard' ERDDAP configuration, with datasets compliant with IOOS Metadata Profile 1.2

https://standards.sensors.ioos.us/erddap/index.html

8 stars 16 forks source link

Docker deamon for ERDDAP hosted on AWS keeps crashing #69

Open MathewBiddle opened 2 months ago

MathewBiddle commented 2 months ago

When running this erddap-gold-standard on AWS, every few weeks the docker daemon for the erddap-gold-standard docker deployment crashes.

$ docker-compose restart
ERROR: Couldn't connect to Docker daemon at http+docker://localhost - is it running?
If it's at a non-standard location, specify the URL with the DOCKER_HOST environment variable.

It's a simple fix to get it up and running again using:

$ sudo systemctl start docker
$ docker-compose restart

I'm curious if other folks have experienced this before with an ERDDAP deployed using Docker on AWS??

I've discussed with @patrick-tripp and the current work around would be to set a cronjob to check the url, if it fails, restart docker.

cc: @mwengren, @ocefpaf, @patrick-tripp.

MathewBiddle commented 2 months ago

maybe live restore??

https://docs.docker.com/config/containers/live-restore/

MathewBiddle commented 2 months ago

Okay, testing live-restore:

$ more /etc/docker/daemon.json
{
        "live-restore": true
}

$ sudo systemctl start docker
$ docker ps
CONTAINER ID   IMAGE                                    COMMAND                  CREATED      STATUS         PORTS                                                                            NAMES
ec3b94b319fe   axiom/docker-erddap:2.23-jdk17-openjdk   "/entrypoint.sh cata…"   7 days ago   Up 5 seconds   0.0.0.0:80->8080/tcp, :::80->8080/tcp, 0.0.0.0:443->8443/tcp, :::443->8443/tcp   erddap_gold_standard

I will check back in a few weeks to see if this fixes the issue. Luckily we have plenty of checks hitting this server, so we will know quickly when it breaks.

MathewBiddle commented 2 months ago

To confirm the change was accepted:

$ docker info | grep Live
 Live Restore Enabled: true

srstsavage commented 2 months ago

Do you have access to the docker daemon logs? Also what are the docker and kernel versions?

MathewBiddle commented 2 months ago

Do you have access to the docker daemon logs?

I have access to /var/log which has a few messages files. I think those are the logs as documented here.

Also what are the docker and kernel versions?

$ docker --version
Docker version 20.10.25, build b82b9f3
$ uname -sr
Linux 5.10.210-201.852.amzn2.x86_64

MathewBiddle commented 2 months ago

Live Restore seems to be working. From status:

Current time is 2024-05-06T15:44:10+00:00
Startup was at  2024-04-17T13:24:27+00:00

I'll keep this open until 2 months have passed without the daemon crashing.

MathewBiddle commented 3 weeks ago

Boo... looks like it crashed again.

$ docker ps
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?

Restarted with:

/usr/local/erddap-gold-standard$ sudo systemctl start docker
/usr/local/erddap-gold-standard$ docker-compose restart
Restarting erddap_gold_standard ... done
/usr/local/erddap-gold-standard$ docker info | grep Live
 Live Restore Enabled: true

ocefpaf commented 3 weeks ago

Boo... looks like it crashed again.

Same frequency as before, sooner, or later? We need to inspect the logs here to see if we can understand what is going on.

MathewBiddle commented 3 weeks ago

much later - almost 3 months vs a few weeks. I looked at the logs and they are gobbledygook to me 😵

ocefpaf commented 3 weeks ago

much later - almost 3 months vs a few weeks. I looked at the logs and they are gobbledygook to me 😵

Well, maybe that is a (small) win. I never looked into ERDDAP logs, we should probably ask for help here from the experts (Ben, Chris, Shane).