useplunk / plunk

The Open-Source Email Platform
https://www.useplunk.com
GNU Affero General Public License v3.0
3.19k stars 149 forks source link

Plunk API Fails Periodically - Self Hosted #114

Open ejscheepers opened 1 month ago

ejscheepers commented 1 month ago

Every now and again the API fails and does not restart.

Server Logs:

ode:internal/deps/undici/undici:13185 2024-10-16T04:29:01.190454265Z Error.captureStackTrace(err); 2024-10-16T04:29:01.190459825Z ^ 2024-10-16T04:29:01.190463705Z 2024-10-16T04:29:01.190467185Z TypeError: fetch failed 2024-10-16T04:29:01.190470865Z at node:internal/deps/undici/undici:13185:13 2024-10-16T04:29:01.190474745Z at process.processTicksAndRejections (node:internal/process/task_queues:105:5) { 2024-10-16T04:29:01.190478825Z [cause]: AggregateError [ETIMEDOUT]: 2024-10-16T04:29:01.190482625Z at internalConnectMultiple (node:net:1122:18) 2024-10-16T04:29:01.190486185Z at internalConnectMultiple (node:net:1190:5) 2024-10-16T04:29:01.190489785Z at Timeout.internalConnectMultipleTimeout (node:net:1716:5) 2024-10-16T04:29:01.190493465Z at listOnTimeout (node:internal/timers:596:11) 2024-10-16T04:29:01.190498985Z at process.processTimers (node:internal/timers:529:7) { 2024-10-16T04:29:01.190502665Z code: 'ETIMEDOUT', 2024-10-16T04:29:01.190506065Z [errors]: [ 2024-10-16T04:29:01.190509545Z Error: connect ETIMEDOUT 188.114.97.3:443 2024-10-16T04:29:01.190513105Z at createConnectionError (node:net:1652:14) 2024-10-16T04:29:01.190516705Z at Timeout.internalConnectMultipleTimeout (node:net:1711:38) 2024-10-16T04:29:01.190520425Z at listOnTimeout (node:internal/timers:596:11) 2024-10-16T04:29:01.190524025Z at process.processTimers (node:internal/timers:529:7) { 2024-10-16T04:29:01.190527745Z errno: -110, 2024-10-16T04:29:01.190531145Z code: 'ETIMEDOUT', 2024-10-16T04:29:01.190534585Z syscall: 'connect', 2024-10-16T04:29:01.190538065Z address: '188.114.97.3', 2024-10-16T04:29:01.190542545Z port: 443 2024-10-16T04:29:01.190545865Z }, 2024-10-16T04:29:01.190549545Z Error: connect ENETUNREACH 2a06:98c1:3121::3:443 - Local (:::0) 2024-10-16T04:29:01.190553745Z at internalConnectMultiple (node:net:1186:16) 2024-10-16T04:29:01.190558345Z at Timeout.internalConnectMultipleTimeout (node:net:1716:5) 2024-10-16T04:29:01.190580945Z at listOnTimeout (node:internal/timers:596:11) 2024-10-16T04:29:01.190585225Z at process.processTimers (node:internal/timers:529:7) { 2024-10-16T04:29:01.190589025Z errno: -101, 2024-10-16T04:29:01.190594705Z code: 'ENETUNREACH', 2024-10-16T04:29:01.190598105Z syscall: 'connect', 2024-10-16T04:29:01.190601545Z address: '2a06:98c1:3121::3', 2024-10-16T04:29:01.190605065Z port: 443 2024-10-16T04:29:01.190608985Z }, 2024-10-16T04:29:01.190612345Z Error: connect ETIMEDOUT 188.114.96.3:443 2024-10-16T04:29:01.190616065Z at createConnectionError (node:net:1652:14) 2024-10-16T04:29:01.190619665Z at Timeout.internalConnectMultipleTimeout (node:net:1711:38) 2024-10-16T04:29:01.190623585Z at listOnTimeout (node:internal/timers:596:11) 2024-10-16T04:29:01.190627105Z at process.processTimers (node:internal/timers:529:7) { 2024-10-16T04:29:01.190630745Z errno: -110, 2024-10-16T04:29:01.190634105Z code: 'ETIMEDOUT', 2024-10-16T04:29:01.190637505Z syscall: 'connect', 2024-10-16T04:29:01.190640905Z address: '188.114.96.3', 2024-10-16T04:29:01.190644305Z port: 443 2024-10-16T04:29:01.190647745Z }, 2024-10-16T04:29:01.190651065Z Error: connect ENETUNREACH 2a06:98c1:3120::3:443 - Local (:::0) 2024-10-16T04:29:01.190655825Z at internalConnectMultiple (node:net:1186:16) 2024-10-16T04:29:01.190659665Z at Timeout.internalConnectMultipleTimeout (node:net:1716:5) 2024-10-16T04:29:01.190663545Z at listOnTimeout (node:internal/timers:596:11) 2024-10-16T04:29:01.190667145Z at process.processTimers (node:internal/timers:529:7) { 2024-10-16T04:29:01.190670745Z errno: -101, 2024-10-16T04:29:01.190674145Z code: 'ENETUNREACH', 2024-10-16T04:29:01.190677785Z syscall: 'connect', 2024-10-16T04:29:01.190681225Z address: '2a06:98c1:3120::3', 2024-10-16T04:29:01.190684665Z port: 443 2024-10-16T04:29:01.190687986Z } 2024-10-16T04:29:01.190691346Z ] 2024-10-16T04:29:01.190695146Z } 2024-10-16T04:29:01.190698506Z } 2024-10-16T04:29:01.190701826Z 2024-10-16T04:29:01.190705146Z Node.js v22.9.0

If I restart container, it starts working again.

ejscheepers commented 1 month ago

Not sure if it might be related, but here are logs from Postgres DB:

2024-10-16T14:40:47.565962634Z 2024-10-16 14:40:47.565 UTC [884] FATAL: role "postgres" does not exist 2024-10-16T14:40:52.620820389Z 2024-10-16 14:40:52.618 UTC [891] FATAL: role "postgres" does not exist 2024-10-16T14:40:57.660249044Z 2024-10-16 14:40:57.660 UTC [898] FATAL: role "postgres" does not exist 2024-10-16T14:41:02.701285029Z 2024-10-16 14:41:02.701 UTC [906] FATAL: role "postgres" does not exist 2024-10-16T14:41:07.741504375Z 2024-10-16 14:41:07.741 UTC [913] FATAL: role "postgres" does not exist 2024-10-16T14:41:12.775925703Z 2024-10-16 14:41:12.775 UTC [920] FATAL: role "postgres" does not exist 2024-10-16T14:41:17.819197070Z 2024-10-16 14:41:17.817 UTC [928] FATAL: role "postgres" does not exist 2024-10-16T14:41:22.866831741Z 2024-10-16 14:41:22.866 UTC [935] FATAL: role "postgres" does not exist 2024-10-16T14:41:27.908494833Z 2024-10-16 14:41:27.908 UTC [942] FATAL: role "postgres" does not exist 2024-10-16T14:41:32.946391915Z 2024-10-16 14:41:32.946 UTC [949] FATAL: role "postgres" does not exist 2024-10-16T14:41:37.981018911Z 2024-10-16 14:41:37.980 UTC [956] FATAL: role "postgres" does not exist 2024-10-16T14:41:43.017404840Z 2024-10-16 14:41:43.017 UTC [963] FATAL: role "postgres" does not exist

ejscheepers commented 1 month ago

Just a bit more context:

version: '3'
services:
  plunk:
    image: driaug/plunk
    depends_on:
      postgresql:
        condition: service_healthy
      redis:
        condition: service_started
    environment:
      - SERVICE_FQDN_PLUNK_3000
      - 'REDIS_URL=redis://redis:6379'
      - 'DATABASE_URL=postgresql://${SERVICE_USER_POSTGRES}:${SERVICE_PASSWORD_POSTGRES}@postgresql/plunk?schema=public'
      - 'JWT_SECRET=${SERVICE_PASSWORD_JWT_SECRET}'
      - 'AWS_REGION=${AWS_REGION}'
      - 'AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}'
      - 'AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}'
      - 'AWS_SES_CONFIGURATION_SET=${AWS_SES_CONFIGURATION_SET}'
      - 'NEXT_PUBLIC_API_URI=${SERVICE_FQDN_PLUNK}/api'
      - 'APP_URI=${SERVICE_FQDN_PLUNK}'
      - 'API_URI=${SERVICE_FQDN_PLUNK}/api'
      - DISABLE_SIGNUPS=False
    entrypoint:
      - /app/entry.sh
    healthcheck:
      test:
        - CMD
        - wget
        - '-q'
        - '--spider'
        - 'http://127.0.0.1:3000'
      interval: 2s
      timeout: 10s
      retries: 15
  postgresql:
    image: 'postgres:16-alpine'
    environment:
      - POSTGRES_USER=$SERVICE_USER_POSTGRES
      - POSTGRES_PASSWORD=$SERVICE_PASSWORD_POSTGRES
      - 'POSTGRES_DB=${POSTGRES_DB:-plunk}'
    volumes:
      - 'postgresql-data:/var/lib/postgresql/data'
    healthcheck:
      test:
        - CMD-SHELL
        - 'pg_isready -U postgres -d postgres'
      interval: 5s
      timeout: 10s
      retries: 20
  redis:
    image: 'redis:7.4-alpine'
    volumes:
      - 'redis-data:/data'
    healthcheck:
      test:
        - CMD
        - redis-cli
        - PING
      interval: 5s
      timeout: 10s
      retries: 20
ejscheepers commented 1 month ago

Not sure if possible @driaug , but adding an api health route would be very useful in the mean time? If the container crashes, we could use it to restart again.

At the moment I am using:

 (wget -S --spider http://127.0.0.1:3000/api/users/@me 2>&1 | grep -q 'HTTP/1.1 [1-4]') 

Before I was only checking http://127.0.0.1:3000, but this would give a false positive as only the dashboard would be running.

ardasevinc commented 5 days ago

I second adding an healthcheck route. I'm using caprover to deploy - here's my captain-definition/one-click-app file for ref.

I think the issue stems from ipv6. I added the env var NODE_OPTIONS=--dns-result-order=ipv4first. Currently testing this, no crashes yet.

edit: I can verify that the node options above fixed this issue, please test. @ejscheepers @driaug edit2: I have switched to --no-network-family-autoselection node option, dns result order didn't work. This is probably an issue with nodejs happy eyeballs implementation.

Code42Cate commented 23 hours ago

@ardasevinc does the new healthcheck route work for you? should be available in the latest version