marinebon / mbon-dashboard-server

server software for MBON early alert dashboard using Docker
1 stars 2 forks source link

fgbnms influxdb unexpected outage #33

Closed 7yl4r closed 2 years ago

7yl4r commented 3 years ago

grafana dashboard showing an error:

Templating init failed
Network Error: Bad Gateway(502)

Looks like the influx docker container is down:

[tylarmurray@fgb-dashboard ~]$ docker container ls --all
CONTAINER ID   IMAGE                                      COMMAND                  CREATED       STATUS                     PORTS                                                 NAMES
9e8ad2767b8f   mbon-dashboard-server_airflow-webserver    "/usr/bin/dumb-init …"   2 weeks ago   Up 2 weeks (unhealthy)     0.0.0.0:8888->8080/tcp, :::8888->8080/tcp             mbon-dashboard-ser
ver_airflow-webserver_1
df786467c4ae   mbon-dashboard-server_airflow-scheduler    "/usr/bin/dumb-init …"   2 weeks ago   Up 2 weeks                 8080/tcp                                              mbon-dashboard-ser
ver_airflow-scheduler_1
38e031bc8865   mbon-dashboard-server_airflow-worker       "/usr/bin/dumb-init …"   2 weeks ago   Up 2 weeks                 8080/tcp                                              mbon-dashboard-ser
ver_airflow-worker_1
4c3ffe371115   mbon-dashboard-server_flower               "/usr/bin/dumb-init …"   2 weeks ago   Up 2 weeks (unhealthy)     0.0.0.0:5555->5555/tcp, :::5555->5555/tcp, 8080/tcp   mbon-dashboard-ser
ver_flower_1
1e56a3f6090a   mbon-dashboard-server_airflow-init         "/usr/bin/dumb-init …"   2 weeks ago   Exited (0) 2 weeks ago                                                           mbon-dashboard-ser
ver_airflow-init_1
1698f54f8302   4900d7864343                               "/bin/bash -o pipefa…"   2 weeks ago   Exited (100) 2 weeks ago                                                         gracious_fermat
8ba916ad97e8   4900d7864343                               "/bin/bash -o pipefa…"   2 weeks ago   Exited (1) 2 weeks ago                                                           goofy_pare
3261d574b126   4900d7864343                               "/bin/bash -o pipefa…"   2 weeks ago   Exited (100) 2 weeks ago                                                         affectionate_kare
204dc15d2393   influxdb:1.8                               "/entrypoint.sh infl…"   4 weeks ago   Exited (137) 4 days ago                                                          influxdb
ee0190582fb5   mbon-dashboard-server_erddap               "/entrypoint.sh cata…"   4 weeks ago   Exited (143) 4 weeks ago                                                         erddap
e6748b243614   grafana/grafana:6.7.3                      "/run.sh"                4 weeks ago   Up 2 weeks                 0.0.0.0:3000->3000/tcp, :::3000->3000/tcp             grafana
acd49f004cd4   mbon-dashboard-server_nginx                "/docker-entrypoint.…"   4 weeks ago   Up 2 weeks                 0.0.0.0:80->80/tcp, :::80->80/tcp                     nginx
25d78a967daf   mbon-dashboard-server_mbon_data_uploader   "waitress-serve --po…"   4 weeks ago   Up 2 weeks                 0.0.0.0:5000->5000/tcp, :::5000->5000/tcp             mbon_data_uploader
dea30dbe308a   postgres:13                                "docker-entrypoint.s…"   4 weeks ago   Up 2 weeks (healthy)       5432/tcp                                              mbon-dashboard-ser
ver_postgres_1
6f3ced987248   redis:latest                               "docker-entrypoint.s…"   4 weeks ago   Up 2 weeks (healthy)       0.0.0.0:6379->6379/tcp, :::6379->6379/tcp             mbon-dashboard-ser
ver_redis_1

Looks like it went down Aug 18 around 20:00

image

image

docker logs influxdb is crammed full of repeating info messages that aren't helpful. I tried getting a more specific log but even doing [tylarmurray@fgb-dashboard ~]$ docker logs --since 2021-08-18T19:59 --until 2021-08-18T20:01 influxdb returns an unwieldy giant string that doesn't pipe easily to file.

7yl4r commented 3 years ago

Brought it back up with a docker-compose up --build -d. Things look good for now.

Leaving this open to see if it happens again. If it happens again the logs may be more useful now that I have configured them.

7yl4r commented 3 years ago

Down again. Logs are empty. 😞

[tylarmurray@fgb-dashboard ~]$ docker container ls --all
CONTAINER ID   IMAGE                                      COMMAND                  CREATED       STATUS            
         PORTS                                                 NAMES
d443e4c6060f   influxdb:1.8                               "/entrypoint.sh infl…"   3 days ago    Exited (137) 2 day
s ago                                                          influxdb
9e8ad2767b8f   mbon-dashboard-server_airflow-webserver    "/usr/bin/dumb-init …"   3 weeks ago   Up 3 weeks (unheal
thy)     0.0.0.0:8888->8080/tcp, :::8888->8080/tcp             mbon-dashboard-server_airflow-webserver_1
df786467c4ae   mbon-dashboard-server_airflow-scheduler    "/usr/bin/dumb-init …"   3 weeks ago   Up 3 weeks        
         8080/tcp                                              mbon-dashboard-server_airflow-scheduler_1
38e031bc8865   mbon-dashboard-server_airflow-worker       "/usr/bin/dumb-init …"   3 weeks ago   Up 3 weeks        
         8080/tcp                                              mbon-dashboard-server_airflow-worker_1
4c3ffe371115   mbon-dashboard-server_flower               "/usr/bin/dumb-init …"   3 weeks ago   Up 3 weeks (unheal
thy)     0.0.0.0:5555->5555/tcp, :::5555->5555/tcp, 8080/tcp   mbon-dashboard-server_flower_1
1e56a3f6090a   mbon-dashboard-server_airflow-init         "/usr/bin/dumb-init …"   3 weeks ago   Exited (0) 3 days 
ago                                                            mbon-dashboard-server_airflow-init_1
1698f54f8302   4900d7864343                               "/bin/bash -o pipefa…"   3 weeks ago   Exited (100) 3 wee
ks ago                                                         gracious_fermat
8ba916ad97e8   4900d7864343                               "/bin/bash -o pipefa…"   3 weeks ago   Exited (1) 3 weeks
 ago                                                           goofy_pare
3261d574b126   4900d7864343                               "/bin/bash -o pipefa…"   3 weeks ago   Exited (100) 3 wee
ks ago                                                         affectionate_kare
ee0190582fb5   mbon-dashboard-server_erddap               "/entrypoint.sh cata…"   5 weeks ago   Exited (143) 4 wee
ks ago                                                         erddap
e6748b243614   grafana/grafana:6.7.3                      "/run.sh"                5 weeks ago   Up 3 weeks        
         0.0.0.0:3000->3000/tcp, :::3000->3000/tcp             grafana
acd49f004cd4   mbon-dashboard-server_nginx                "/docker-entrypoint.…"   5 weeks ago   Up 3 weeks        
         0.0.0.0:80->80/tcp, :::80->80/tcp                     nginx
25d78a967daf   mbon-dashboard-server_mbon_data_uploader   "waitress-serve --po…"   5 weeks ago   Up 3 weeks        
         0.0.0.0:5000->5000/tcp, :::5000->5000/tcp             mbon_data_uploader
dea30dbe308a   postgres:13                                "docker-entrypoint.s…"   5 weeks ago   Up 3 weeks (health
y)       5432/tcp                                              mbon-dashboard-server_postgres_1
6f3ced987248   redis:latest                               "docker-entrypoint.s…"   5 weeks ago   Up 3 weeks (health
y)       0.0.0.0:6379->6379/tcp, :::6379->6379/tcp             mbon-dashboard-server_redis_1
[tylarmurray@fgb-dashboard ~]$ docker logs influxdb
[tylarmurray@fgb-dashboard ~]$ 

So... not much to use to figure out what is going on here. The only clue I see is that this outage again was around 8PM.

7yl4r commented 3 years ago

Went out at 8PM again. Worth noting here that these times are 8PM central time, which is 1AM UTC.

image

Nothing in docker logs influxdb. Let's try clearing the container and starting from scratch.

[tylarmurray@fgb-dashboard mbon-dashboard-server]$ docker container prune
WARNING! This will remove all stopped containers.
Are you sure you want to continue? [y/N] y
Deleted Containers:
d443e4c6060f5925dc329ee9bc9efc4b549497d487714689630b61459da7144a
1e56a3f6090a2fb80b37440e9f8161c52d35d9c48a7edcf8b199f7f12c072cc9
1698f54f83021efc40b724a43ff4d59aeff0cfa046418608d07752d913b04870
8ba916ad97e82bbdaa28ec140999de3e65a9a932fdaadbdd8104044546a48071
3261d574b126c61e760ed4a11e0a434a37c66821cc10fd42a92ffb4fb2e5545f
ee0190582fb54a64df08ede942f0f96cd2603d48a87933c6c8a6dcd80e84ec51
Total reclaimed space: 995.6MB

[tylarmurray@fgb-dashboard mbon-dashboard-server]$ docker image prune
WARNING! This will remove all dangling images.
Are you sure you want to continue? [y/N] y
Deleted Images:
deleted: sha256:9962265bcc2db0154a46f142923e2791ea5c613e999f78fc9109a2dcfbabc764
deleted: sha256:f726e7bcf9a92c0616b5d655cc034538ceb81250cd5330cc28458d696c01b2d6
deleted: sha256:1d8898270a37dc3615831966bf1b0726a93e5ac27333bc7d59ebaf12b58dc81a
deleted: sha256:39c73224c98dd516566669e490567ca5fd80b9c26fdb15785ffd85aa9e947ae8
deleted: sha256:9ad3c011fa481055272cad564e751fb818b2d05dd5bf216bf94b81f4a03d0d46
deleted: sha256:fc9ee1ea3b1fa366124c21555ba21f4b99ef5cca612e2c7e4013fb5a9be5ebc7
deleted: sha256:12c6e1d79bec8067eb72f9673c7416d70279406b6c25d4561bf70082b457dc03
deleted: sha256:945f7cdaee1ad4a205da875954f93a2e3140c5839fdcae8b41358bc092a53e07
deleted: sha256:897f44374e8cc4623f1d91b49a3a585760531533a9557a3192d1d0cd00dd96e4
deleted: sha256:6dd06a9806ab547d20ffc22ce70c44d77cee857a44898edd461b5d9e866a0471
deleted: sha256:eab922a70d49b8a104b0f382cf94514baf5551afe2839f93a10038a10e8765c8
deleted: sha256:2b35a01ef07a221396528c5074c5f8b905b6c7b81d4e651d303a298def209d06
deleted: sha256:3dedbbef41224ff676a267f603dddfd9401854ade86cff7c07283b13537f2bd8
deleted: sha256:711d28c17ba89ad1f959807199872bce4ad1d98c7d927ceb2a5a43dbd4f707d7
deleted: sha256:3e31e04befd5c3dc131542c368997c3a48bc8ad0f3d48c631b25117816b57e5d
deleted: sha256:aace65a70a957503ac3bb471b4a0ce7982cc886d37c0dff8500b1d69defe093b
deleted: sha256:9b2ab841cd0a6683bd80d845f8b8dd03ea796d5b0ae54b8dcc26eed5aeea3bdd
deleted: sha256:359d2c703ffffe17c8ef56319ec418b8051e98843c5ef01811f3ecc2ab5e8b57
Total reclaimed space: 295.1MB
[tylarmurray@fgb-dashboard mbon-dashboard-server]$ docker volume prune
WARNING! This will remove all local volumes not used by at least one container.
Are you sure you want to continue? [y/N] y
Total reclaimed space: 0B

Building nginx
Sending build context to Docker daemon  2.048kB

Step 1/2 : FROM nginx:stable
 ---> c2c45d506085
Step 2/2 : RUN apt-get update; apt-get --assume-yes install git
 ---> Using cache
 ---> 7858550fe189
Successfully built 7858550fe189
Successfully tagged mbon-dashboard-server_nginx:latest
Building mbon_data_uploader
Sending build context to Docker daemon  52.74kB

Step 1/12 : FROM python:3.8
 ---> 02583ab5c95e
Step 2/12 : COPY . /opt/mbon_data_uploader
 ---> Using cache
 ---> c6745937de3e
Step 3/12 : WORKDIR /opt/mbon_data_uploader
 ---> Using cache
 ---> 8a810111b5e5
Step 4/12 : RUN pip install -r requirements.txt
 ---> Using cache
 ---> cd8c6ee141b2
Step 5/12 : WORKDIR /opt/go-ipfs
 ---> Using cache
 ---> 3f5f09fb6962
Step 6/12 : RUN wget https://dist.ipfs.io/go-ipfs/v0.6.0/go-ipfs_v0.6.0_linux-amd64.tar.gz
 ---> Using cache
 ---> 4ca60f194b77
Step 7/12 : RUN tar xvfz go-ipfs_v0.6.0_linux-amd64.tar.gz
 ---> Using cache
 ---> 145c402ba3e4
Step 8/12 : WORKDIR /opt/go-ipfs/go-ipfs
 ---> Using cache
 ---> 1c236e081b1f
Step 9/12 : RUN ./install.sh
 ---> Using cache
 ---> aa3e758ec3e8
Step 10/12 : RUN ipfs init
 ---> Using cache
 ---> 9aabcc1a4dec
Step 11/12 : WORKDIR /opt/mbon_data_uploader
 ---> Using cache
 ---> f76dd3045f08
Step 12/12 : ENTRYPOINT ["waitress-serve", "--port=5000", "--call", "mbon_data_uploader:create_app"]
 ---> Using cache
 ---> a52207802f51
Successfully built a52207802f51
Successfully tagged mbon-dashboard-server_mbon_data_uploader:latest
Building airflow-webserver
Sending build context to Docker daemon  2.365GB

Step 1/3 : FROM apache/airflow:2.1.2
 ---> 4900d7864343
Step 2/3 : USER root
 ---> Using cache
 ---> fea475cf2c9f
Step 3/3 : RUN apt-get update &&     apt-get install --yes --no-install-recommends wget build-essential &&     wget https://curl.se/download/curl-7.78.0.tar.gz &&     tar -xvf curl-7.78.0.tar.gz && cd curl-7.78.0 &&     ./configure --with-gnutls && make && make install
 ---> Using cache
 ---> 63ec63a3dc93
Successfully built 63ec63a3dc93
Successfully tagged mbon-dashboard-server_airflow-webserver:latest
Building airflow-scheduler
Sending build context to Docker daemon  2.365GB

Step 1/3 : FROM apache/airflow:2.1.2
 ---> 4900d7864343
Step 2/3 : USER root
 ---> Using cache
 ---> fea475cf2c9f
Step 3/3 : RUN apt-get update &&     apt-get install --yes --no-install-recommends wget build-essential &&     wget https://curl.se/download/curl-7.78.0.tar.gz &&     tar -xvf curl-7.78.0.tar.gz && cd curl-7.78.0 &&     ./configure --with-gnutls && make && make install
 ---> Using cache
 ---> 63ec63a3dc93
Successfully built 63ec63a3dc93
Successfully tagged mbon-dashboard-server_airflow-scheduler:latest
Building airflow-worker
Sending build context to Docker daemon  2.365GB

Step 1/3 : FROM apache/airflow:2.1.2
 ---> 4900d7864343
Step 2/3 : USER root
 ---> Using cache
 ---> fea475cf2c9f
Step 3/3 : RUN apt-get update &&     apt-get install --yes --no-install-recommends wget build-essential &&     wget https://curl.se/download/curl-7.78.0.tar.gz &&     tar -xvf curl-7.78.0.tar.gz &
& cd curl-7.78.0 &&     ./configure --with-gnutls && make && make install
 ---> Using cache
 ---> 63ec63a3dc93
Successfully built 63ec63a3dc93
Successfully tagged mbon-dashboard-server_airflow-worker:latest
Building airflow-init
ERRO[0001] Can't add file /home/tylarmurray/mbon-dashboard-server/airflow/logs/dag_processor_manager/dag_processor_manager.log to tar: archive/tar: write too long 
Sending build context to Docker daemon  2.365GB
Step 1/3 : FROM apache/airflow:2.1.2
 ---> 4900d7864343
Step 2/3 : USER root
 ---> Using cache
 ---> fea475cf2c9f
Step 3/3 : RUN apt-get update &&     apt-get install --yes --no-install-recommends wget build-essential &&     wget https://curl.se/download/curl-7.78.0.tar.gz &&     tar -xvf curl-7.78.0.tar.gz &
& cd curl-7.78.0 &&     ./configure --with-gnutls && make && make install
 ---> Using cache
 ---> 63ec63a3dc93
Successfully built 63ec63a3dc93
Successfully tagged mbon-dashboard-server_airflow-init:latest
Building flower
Sending build context to Docker daemon  2.365GB
Step 1/3 : FROM apache/airflow:2.1.2
 ---> 4900d7864343
Step 2/3 : USER root
 ---> Using cache
 ---> fea475cf2c9f
Step 3/3 : RUN apt-get update &&     apt-get install --yes --no-install-recommends wget build-essential &&     wget https://curl.se/download/curl-7.78.0.tar.gz &&     tar -xvf curl-7.78.0.tar.gz &
& cd curl-7.78.0 &&     ./configure --with-gnutls && make && make install
 ---> Using cache
 ---> 63ec63a3dc93
Successfully built 63ec63a3dc93
Successfully tagged mbon-dashboard-server_flower:latest
mbon-dashboard-server_redis_1 is up-to-date
mbon_data_uploader is up-to-date
mbon-dashboard-server_postgres_1 is up-to-date
grafana is up-to-date
nginx is up-to-date
Creating influxdb ... 
mbon-dashboard-server_airflow-worker_1 is up-to-date
mbon-dashboard-server_airflow-scheduler_1 is up-to-date
Creating mbon-dashboard-server_airflow-init_1 ... 
Creating influxdb                             ... done
Creating mbon-dashboard-server_airflow-init_1 ... done
7yl4r commented 3 years ago

I am suspecting this may be because the FGB gcloud instance is "preemptible". When looking at the console I see the message: This instance is preemptible and will live at most 24 hours.

Curious that this reset each night at 00:00 UTC affects the influxdb container and not the others though.

This VM is preemptible because that saves us a lot of money. Details from google here.

7yl4r commented 3 years ago

I installed the following crontab to start things back up each night at 00:05 following the 00:00 outage:

[tylarmurray@fgb-dashboard ~]$ crontab -l
05 00 * * * cd /home/tylarmurray/mbon-dashboard-server/ && docker-compose up --build -d
7yl4r commented 3 years ago

Seems like the attempted fix did not work. Things are down again this morning.

Looking at /var/logs/cron, the job was successfully triggered:

[tylarmurray@fgb-dashboard ~]$ sudo less /var/log/cron
[...]
Sep 10 00:01:01 fgb-dashboard anacron[2629141]: Normal exit (0 jobs run)
Sep 10 00:05:01 fgb-dashboard CROND[2632778]: (tylarmurray) CMD (cd /home/tylarmurray/mbon-dashboard-server/ && docker-compose up --build -d)
Sep 10 01:01:01 fgb-dashboard CROND[2661595]: (root) CMD (run-parts /etc/cron.hourly)

looking at docker I think that the container exited when I fixed it yesterday and was not started by the cronjob after that.

[tylarmurray@fgb-dashboard ~]$ sudo docker container ls --all
CONTAINER ID   IMAGE                                      COMMAND                  CREATED       STATUS                      PORTS                                                 NAMES
[...]
2c9ead08b05e   influxdb:1.8                               "/entrypoint.sh infl…"   10 days ago   Exited (137) 23 hours ago                                                         influxdb
[...]

[tylarmurray@fgb-dashboard ~]$ sudo docker container inspect influxdb
[
    {
[...]
        "State": {
            "Status": "exited",
[...]
            "ExitCode": 137,
            "Error": "",
            "StartedAt": "2021-09-09T17:17:15.606768941Z",
            "FinishedAt": "2021-09-09T17:22:02.31855507Z"
        },
[...]

docker logs influxdb is empty.

7yl4r commented 3 years ago

I brought it back up just now and modified the crontab to keep a log:

[tylarmurray@fgb-dashboard mbon-dashboard-server]$ crontab -l
05 00 * * * cd /home/tylarmurray/mbon-dashboard-server/ && docker-compose up --build -d > /home/tylarmurray/cronjob-server-restart.log
7yl4r commented 3 years ago

Down again. Nothing in the output file. Nothing in the docker logs.

[tylarmurray@fgb-dashboard ~]$ docker container ls --all
CONTAINER ID   IMAGE                                      COMMAND                  CREATED       STATUS                    PORTS                                                 NAMES
[...]
2c9ead08b05e   influxdb:1.8                               "/entrypoint.sh infl…"   2 weeks ago   Exited (137) 7 days ago                                                         influxdb
[...]
[tylarmurray@fgb-dashboard ~]$ cat cronjob-server-restart.log 
[tylarmurray@fgb-dashboard ~]$ docker logs influxdb

The return status reported by docker container (137) is a sigkill which means the process is being externally terminated. I think this supports the theory that gcloud is killing this process but I can't explain why the cron isn't bringing it back up and why the logs are all empty.

I am doing a test run of the crontab like this:

crontab -l | grep -v '^#' | cut -f 6- -d ' ' | while read CMD; do eval $CMD; done

That worked. After it ran cronjob-server-restart.log had all the log content you would expect to be there.

Why is the log empty when cron triggers? The cron is definitely running as it should:

[tylarmurray@fgb-dashboard ~]$ sudo grep docker /var/log/cron
Sep 13 00:05:01 fgb-dashboard CROND[666448]: (tylarmurray) CMD (cd /home/tylarmurray/mbon-dashboard-server/ && docker-compose up --build -d > /home/tylarmurray/cronjob-server-restart.log)
Sep 14 00:05:01 fgb-dashboard CROND[1410973]: (tylarmurray) CMD (cd /home/tylarmurray/mbon-dashboard-server/ && docker-compose up --build -d > /home/tylarmurray/cronjob-server-restart.log)
Sep 15 00:05:02 fgb-dashboard CROND[2154137]: (tylarmurray) CMD (cd /home/tylarmurray/mbon-dashboard-server/ && docker-compose up --build -d > /home/tylarmurray/cronjob-server-restart.log)
Sep 16 00:05:01 fgb-dashboard CROND[2897401]: (tylarmurray) CMD (cd /home/tylarmurray/mbon-dashboard-server/ && docker-compose up --build -d > /home/tylarmurray/cronjob-server-restart.log)
Sep 17 00:05:01 fgb-dashboard CROND[3642680]: (tylarmurray) CMD (cd /home/tylarmurray/mbon-dashboard-server/ && docker-compose up --build -d > /home/tylarmurray/cronjob-server-restart.log)

I am stumped. For right now I set the crontab to run every hour at **:15. I don't know how that would help but let's see what happens. :shrug:

7yl4r commented 2 years ago

Things are still down. I am modifying the crontab to redirect stderr & stdout with &> instead of > and have ensured that there is a newline after the line. Maybe that will give back something useful after the next run on the hour.

7yl4r commented 2 years ago

Ahaaaa...

[tylarmurray@fgb-dashboard ~]$ cat /home/tylarmurray/cronjob-server-restart.log 
/bin/sh: docker-compose: command not found

So I modified the crontab to have the full path.

7yl4r commented 2 years ago

Things are still down but upon closer inspection I think that I actually did fix the previous issue but there is a new issue that has to do with the amount of space in the VM:

[tylarmurray@fgb-dashboard ~]$ docker logs influxdb
ts=2021-10-12T08:28:06.149931Z lvl=warn msg="Error compacting TSM files" log_id=0X8df5fG000 engine=tsm1 tsm1_level=1 tsm1_strategy=level trace_id=0X8ucOyW000 op_name=tsm1_compact_group error="writ
e /var/lib/influxdb/data/_internal/monitor/5212/000001020-000000002.tsm.tmp: no space left on device"
ts=2021-10-12T08:29:07.191418Z lvl=warn msg="Error compacting TSM files" log_id=0X8df5fG000 engine=tsm1 tsm1_level=1 tsm1_strategy=level trace_id=0X8uhwPW000 op_name=tsm1_compact_group error="writ
e /var/lib/influxdb/data/_internal/monitor/5212/000001020-000000002.tsm.tmp: no space left on device"
ts=2021-10-12T18:43:37.595258Z lvl=warn msg="Error compacting TSM files" log_id=0X90hX~l000 engine=tsm1 tsm1_level=1 tsm1_strategy=level trace_id=0X9STsiW000 op_name=tsm1_compact_group error="writ
e /var/lib/influxdb/data/_internal/monitor/5212/000001196-000000002.tsm.tmp: no space left on device"

There is ~10GB unused on the disk however, so I don't see why that is happening:

[tylarmurray@fgb-dashboard ~]$ docker exec -it influxdb /bin/bash
root@2c9ead08b05e:/# df -h
Filesystem      Size  Used Avail Use% Mounted on
overlay          50G   41G  9.5G  82% /
tmpfs            64M     0   64M   0% /dev
tmpfs           3.8G     0  3.8G   0% /sys/fs/cgroup
shm              64M     0   64M   0% /dev/shm
/dev/sda2        50G   41G  9.5G  82% /docker-entrypoint-initdb.d
tmpfs           3.8G     0  3.8G   0% /proc/acpi
tmpfs           3.8G     0  3.8G   0% /proc/scsi
tmpfs           3.8G     0  3.8G   0% /sys/firmware

One of these directories appears anomalously big though:

root@2c9ead08b05e:/# du -sh /var/lib/influxdb/data/_internal/monitor/*
98M     /var/lib/influxdb/data/_internal/monitor/5211
23G     /var/lib/influxdb/data/_internal/monitor/5212
125M    /var/lib/influxdb/data/_internal/monitor/5213

That director is full of ~90M .tsm files.

I may need to add some more space or this may be a fluke. I am re-initializing the influx container and we will see if this happens again.

Worth noting here that the VM is crawling along; very slow. I did a reboot and things are running better, but this might just be because the grafana & influxdb containers are down.

[tylarmurray@fgb-dashboard ~]$ docker stop influxdb
influxdb
[tylarmurray@fgb-dashboard ~]$ docker container prune -f
Deleted Containers:
4d1c13001a1e7d71664dfb3568416a42f50268455c2443c1faddb2d1927533ee
2c9ead08b05e708d56c8380b849479a9e62e5319205d404d537f089ec9b9b94e
Total reclaimed space: 43.82MB
[tylarmurray@fgb-dashboard ~]$ docker image prune -f
Total reclaimed space: 0B
[tylarmurray@fgb-dashboard ~]$ docker volume prune -f
Total reclaimed space: 0B
[tylarmurray@fgb-dashboard mbon-dashboard-server]$ docker-compose up --build -d

Now things are up but the 502 error is still there and docker logs influxdb has nothing in it.

7yl4r commented 2 years ago

aaaaaaand it is working again.

😕

7yl4r commented 2 years ago

The images are still out and I think this is due to some NFS mounting weirdness. I have reset thing2 because https://github.com/USF-IMARS/server-status/issues/24 was happening. ERDDAP is reloading the dataset now and I am hoping that fixes things.

7yl4r commented 2 years ago

I am opening a new issue for the images being out. This thread is getting too long and it is clear now that it is an unrelated issue.

7yl4r commented 2 years ago

This issue is back!

Looks like it went down around 2021-11-13 00:00

image

New sort of incarnation though: this time the disk is fine and the logs are empty.

Restarting influx didn't fix it:

[tylarmurray@fgb-dashboard ~]$ docker container ls --all
CONTAINER ID   IMAGE                                      COMMAND                  CREATED        STATUS                      PORTS                                                                                  NAMES
9d179f516236   mbon-dashboard-server_airflow-init         "/usr/bin/dumb-init …"   5 weeks ago    Exited (0) 38 minutes ago                                                                                          mbon-dashboard-server_airflow-ini
t_1
e96fe85bad74   influxdb:1.8                               "/entrypoint.sh infl…"   5 weeks ago    Up 7 days                   0.0.0.0:2003->2003/tcp, :::2003->2003/tcp, 0.0.0.0:8086->8086/tcp, :::8086->8086/tcp   influxdb
9e8ad2767b8f   mbon-dashboard-server_airflow-webserver    "/usr/bin/dumb-init …"   3 months ago   Up 7 days (unhealthy)       0.0.0.0:8888->8080/tcp, :::8888->8080/tcp                                              mbon-dashboard-server_airflow-web
server_1
df786467c4ae   mbon-dashboard-server_airflow-scheduler    "/usr/bin/dumb-init …"   3 months ago   Up 7 days                   8080/tcp                                                                               mbon-dashboard-server_airflow-sch
eduler_1
38e031bc8865   mbon-dashboard-server_airflow-worker       "/usr/bin/dumb-init …"   3 months ago   Up 5 weeks                  8080/tcp                                                                               mbon-dashboard-server_airflow-wor
ker_1
4c3ffe371115   mbon-dashboard-server_flower               "/usr/bin/dumb-init …"   3 months ago   Up 5 weeks (unhealthy)      0.0.0.0:5555->5555/tcp, :::5555->5555/tcp, 8080/tcp                                    mbon-dashboard-server_flower_1
e6748b243614   grafana/grafana:6.7.3                      "/run.sh"                4 months ago   Up 5 weeks                  0.0.0.0:3000->3000/tcp, :::3000->3000/tcp                                              grafana
acd49f004cd4   mbon-dashboard-server_nginx                "/docker-entrypoint.…"   4 months ago   Up 5 weeks                  0.0.0.0:80->80/tcp, :::80->80/tcp                                                      nginx
25d78a967daf   mbon-dashboard-server_mbon_data_uploader   "waitress-serve --po…"   4 months ago   Up 5 weeks                  0.0.0.0:5000->5000/tcp, :::5000->5000/tcp                                              mbon_data_uploader
dea30dbe308a   postgres:13                                "docker-entrypoint.s…"   4 months ago   Up 5 weeks (healthy)        5432/tcp                                                                               mbon-dashboard-server_postgres_1
6f3ced987248   redis:latest                               "docker-entrypoint.s…"   4 months ago   Up 5 weeks (healthy)        0.0.0.0:6379->6379/tcp, :::6379->6379/tcp                                              mbon-dashboard-server_redis_1
[tylarmurray@fgb-dashboard ~]$ docker restart influxdb
influxdb
[tylarmurray@fgb-dashboard ~]$ docker container ls --all
CONTAINER ID   IMAGE                                      COMMAND                  CREATED        STATUS                      PORTS                                                                                  NAMES
9d179f516236   mbon-dashboard-server_airflow-init         "/usr/bin/dumb-init …"   5 weeks ago    Exited (0) 50 minutes ago                                                                                          mbon-dashboard-server_airflow-ini
t_1
e96fe85bad74   influxdb:1.8                               "/entrypoint.sh infl…"   5 weeks ago    Up 16 seconds               0.0.0.0:2003->2003/tcp, :::2003->2003/tcp, 0.0.0.0:8086->8086/tcp, :::8086->8086/tcp   influxdb
9e8ad2767b8f   mbon-dashboard-server_airflow-webserver    "/usr/bin/dumb-init …"   3 months ago   Up 7 days (unhealthy)       0.0.0.0:8888->8080/tcp, :::8888->8080/tcp                                              mbon-dashboard-server_airflow-web
server_1
df786467c4ae   mbon-dashboard-server_airflow-scheduler    "/usr/bin/dumb-init …"   3 months ago   Up 7 days                   8080/tcp                                                                               mbon-dashboard-server_airflow-sch
eduler_1
38e031bc8865   mbon-dashboard-server_airflow-worker       "/usr/bin/dumb-init …"   3 months ago   Up 5 weeks                  8080/tcp                                                                               mbon-dashboard-server_airflow-wor
ker_1
4c3ffe371115   mbon-dashboard-server_flower               "/usr/bin/dumb-init …"   3 months ago   Up 5 weeks (unhealthy)      0.0.0.0:5555->5555/tcp, :::5555->5555/tcp, 8080/tcp                                    mbon-dashboard-server_flower_1
e6748b243614   grafana/grafana:6.7.3                      "/run.sh"                4 months ago   Up 5 weeks                  0.0.0.0:3000->3000/tcp, :::3000->3000/tcp                                              grafana
acd49f004cd4   mbon-dashboard-server_nginx                "/docker-entrypoint.…"   4 months ago   Up 5 weeks                  0.0.0.0:80->80/tcp, :::80->80/tcp                                                      nginx
25d78a967daf   mbon-dashboard-server_mbon_data_uploader   "waitress-serve --po…"   4 months ago   Up 5 weeks                  0.0.0.0:5000->5000/tcp, :::5000->5000/tcp                                              mbon_data_uploader
dea30dbe308a   postgres:13                                "docker-entrypoint.s…"   4 months ago   Up 5 weeks (healthy)        5432/tcp                                                                               mbon-dashboard-server_postgres_1
6f3ced987248   redis:latest                               "docker-entrypoint.s…"   4 months ago   Up 5 weeks (healthy)        0.0.0.0:6379->6379/tcp, :::6379->6379/tcp                                              mbon-dashboard-server_redis_1
7yl4r commented 2 years ago

Working again after restart. Will keep fingers crossed until it happens again and reopen.