useplunk / plunk

The Open-Source Email Platform
https://www.useplunk.com
GNU Affero General Public License v3.0
2.66k stars 117 forks source link

unknown fetch failed from backend #44

Closed lionep closed 1 month ago

lionep commented 1 month ago

Hey ! Nice project here, thank you very much for opensourcing it !

I have a crash about 1 minute after backend is starting.

In docker-compose logs :

backend-1  | ℹ  info      Running scheduled tasks
backend-1  | ℹ  info      Updating verified identities
backend-1  | node:internal/deps/undici/undici:13178
backend-1  |       Error.captureStackTrace(err);
backend-1  |             ^
backend-1  |
backend-1  | TypeError: fetch failed
backend-1  |     at node:internal/deps/undici/undici:13178:13
backend-1  |     at processTicksAndRejections (node:internal/process/task_queues:95:5)
backend-1  |     at runNextTicks (node:internal/process/task_queues:64:3)
backend-1  |     at process.processImmediate (node:internal/timers:454:9) {
backend-1  |   [cause]: ConnectTimeoutError: Connect Timeout Error
backend-1  |       at onConnectTimeout (node:internal/deps/undici/undici:2331:28)
backend-1  |       at node:internal/deps/undici/undici:2283:50
backend-1  |       at Immediate._onImmediate (node:internal/deps/undici/undici:2315:13)
backend-1  |       at process.processImmediate (node:internal/timers:483:21) {
backend-1  |     code: 'UND_ERR_CONNECT_TIMEOUT'
backend-1  |   }
backend-1  | }

Here is the docker-compose.yml (plunk is hosted behind traefik)

version: '3'

services:
  backend:
    image: driaug/plunk:latest
    labels:
      - traefik.http.routers.${SERVICE_NAME}httpRouter.rule=Host(`${SERVICE_HOST}`)
      - traefik.http.routers.${SERVICE_NAME}httpRouter.entrypoints=http
      - traefik.http.routers.${SERVICE_NAME}httpRouter.middlewares=${SERVICE_NAME}redirect
      - traefik.http.middlewares.${SERVICE_NAME}redirect.redirectscheme.scheme=https
      - traefik.http.middlewares.${SERVICE_NAME}redirect.redirectscheme.port=443
      - traefik.http.routers.${SERVICE_NAME}httpsRouter.rule=Host(`${SERVICE_HOST}`)
      - traefik.http.routers.${SERVICE_NAME}httpsRouter.entrypoints=https
      - traefik.http.routers.${SERVICE_NAME}httpsRouter.tls=true
      - traefik.http.routers.${SERVICE_NAME}httpsRouter.tls.certresolver=letsEncryptResolver
      - traefik.enable=true
    environment:
      - AWS_REGION=XX-XXXXX-XX
      - NEXT_PUBLIC_AWS_REGION=XX-XXXXX-XX
      - AWS_ACCESS_KEY_ID=XXXXXXXXXXXXXXX
      - AWS_SECRET_ACCESS_KEY=XXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
      - AWS_SES_CONFIGURATION_SET=plunk-configuration-set
      - JWT_SECRET=SUPER_SECRET_KEY
      - APP_URI=plunk.mydomain.com
      - API_URI=https://plunk.mydomain.com/api
      - NEXT_PUBLIC_API_URI=plunk.mydomain.com
      - DISABLE_SIGNUPS=False
      - REDIS_URL=redis://redis:6379
      - DATABASE_URL=postgresql://postgres:SUPER_SECRET_PASSWORD@db:5432/postgres
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_started
    restart: always
    entrypoint: [ "/app/entry.sh" ]
    networks:
      - plunk
      - traefik

  db:
    image: postgres
    environment:
      POSTGRES_PASSWORD: SUPER_SECRET_PASSWORD
      POSTGRES_USER: postgres
      POSTGRES_DB: postgres
    volumes:
      - ./volumes/db:/var/lib/postgresql/data
    healthcheck:
      test: [ "CMD-SHELL", "pg_isready -U postgres -d postgres" ]
      interval: 10s
      retries: 5
      timeout: 10s
    networks:
      - plunk

  redis:
    image: redis
    networks:
      - plunk

networks:
  plunk:
  traefik:
    external:
      name: traefik_network

Thanks, let me know if I can provide more info with this.

driaug commented 1 month ago

This part does not seem fully correct.

- APP_URI=plunk.mydomain.com
- API_URI=https://plunk.mydomain.com/api
- NEXT_PUBLIC_API_URI=plunk.mydomain.com

You should always specify https:// before all your URLs + your NEXT_PUBLIC_API_URI does not match your API_URI. Is it possible that Plunk's API is not reachable on the URL you specify?

Can you try visiting the root of the API in your browser (/api)? It should show something like this

{"code":404,"error":"Not Found","message":"Unknown route","time":1723209594711}
lionep commented 1 month ago

Thanks, here is what I've tried :

      - APP_URI=https://plunk.mydomain.com
      - API_URI=https://plunk.mydomain.com/api
      - NEXT_PUBLIC_API_URI=https://plunk.mydomain.com/api

And I have the same result : the log appear, and the container in unresponsive until restart.

When accessing https://plunk.mydomain.com/api from my browser : It's spinning, no response from server and at one point, I'm redirected to https://plunk.mydomain.com:3000/api/ (and of course, no response on this port).

driaug commented 1 month ago

Plunk tries to call the same externally facing URL. If you can't see it, neither can Plunk.

If you cannot access it right after starting up, so before the automated jobs start failing, then there is probably something wrong with your traefik setup.

On another note, the automated jobs do make it harder to debug since you only get a minute of time to test. I will be adding a feature flag to disable those!

lionep commented 1 month ago

Which environment variable the backend is using for the fetch ?

Shouldn't be one of the env (I don't know which one) : http://localhost:3000 or http://localhost:3000/api because the call is internal to the docker container ?

lionep commented 1 month ago

It seems that the issue is coming from nginx config that is redirecting my client :

> curl -v https://plunk.mydomain.com/api
* Host plunk.mydomain.com:443 was resolved.
* IPv6: (none)
* IPv4: 11.22.33.44
*   Trying 11.22.33.44:443...
* Connected to plunk.mydomain.com (11.22.33.44) port 443
* ALPN: curl offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* (304) (IN), TLS handshake, Server hello (2):
* (304) (IN), TLS handshake, Unknown (8):
* (304) (IN), TLS handshake, Certificate (11):
* (304) (IN), TLS handshake, CERT verify (15):
* (304) (IN), TLS handshake, Finished (20):
* (304) (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / AEAD-CHACHA20-POLY1305-SHA256 / [blank] / UNDEF
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=plunk.mydomain.com
*  start date: Aug  6 09:50:53 2024 GMT
*  expire date: Nov  4 09:50:52 2024 GMT
*  subjectAltName: host "plunk.mydomain.com" matched cert's "plunk.mydomain.com"
*  issuer: C=US; O=Let's Encrypt; CN=R11
*  SSL certificate verify ok.
* using HTTP/2
* [HTTP/2] [1] OPENED stream for https://plunk.mydomain.com/api
* [HTTP/2] [1] [:method: GET]
* [HTTP/2] [1] [:scheme: https]
* [HTTP/2] [1] [:authority: plunk.mydomain.com]
* [HTTP/2] [1] [:path: /api]
* [HTTP/2] [1] [user-agent: curl/8.6.0]
* [HTTP/2] [1] [accept: */*]
> GET /api HTTP/2
> Host: plunk.mydomain.com
> User-Agent: curl/8.6.0
> Accept: */*
>
< HTTP/2 301
< content-type: text/html
< date: Fri, 09 Aug 2024 13:38:39 GMT
< location: http://plunk.mydomain.com:3000/api/
< server: nginx/1.26.1
< content-length: 169
lionep commented 1 month ago

Sorry for spamming, but i've some progress :

I bypassed nginx with this traefik config :

  labels:
      - traefik.http.routers.${SERVICE_NAME}httpRouter.rule=Host(`${SERVICE_HOST}`)
      - traefik.http.routers.${SERVICE_NAME}httpRouter.entrypoints=http
      - traefik.http.routers.${SERVICE_NAME}httpRouter.middlewares=${SERVICE_NAME}redirect
      - traefik.http.middlewares.${SERVICE_NAME}redirect.redirectscheme.scheme=https
      - traefik.http.middlewares.${SERVICE_NAME}redirect.redirectscheme.port=443
      - traefik.http.routers.${SERVICE_NAME}ApiHttpsRouter.rule=Host(`${SERVICE_HOST}`) && PathPrefix(`/api`)
      - traefik.http.routers.${SERVICE_NAME}ApiHttpsRouter.entrypoints=https
      - traefik.http.routers.${SERVICE_NAME}ApiHttpsRouter.tls=true
      - traefik.http.routers.${SERVICE_NAME}ApiHttpsRouter.tls.certresolver=letsEncryptResolver
      - traefik.http.routers.${SERVICE_NAME}ApiHttpsRouter.service=plunk-api
      - traefik.http.routers.${SERVICE_NAME}ApiHttpsRouter.middlewares=plunk-removeApi-prefix
      - traefik.http.middlewares.plunk-removeApi-prefix.stripprefix.prefixes=/api
      - traefik.http.routers.${SERVICE_NAME}OtherHttpsRouter.rule=Host(`${SERVICE_HOST}`)
      - traefik.http.routers.${SERVICE_NAME}OtherHttpsRouter.entrypoints=https
      - traefik.http.routers.${SERVICE_NAME}OtherHttpsRouter.tls=true
      - traefik.http.routers.${SERVICE_NAME}OtherHttpsRouter.tls.certresolver=letsEncryptResolver
      - traefik.http.routers.${SERVICE_NAME}OtherHttpsRouter.service=plunk-other
      - traefik.http.services.plunk-api.loadbalancer.server.port=4000
      - traefik.http.services.plunk-other.loadbalancer.server.port=5000
      - traefik.enable=true
    expose:
      - 4000
      - 5000

This way, traefik is balancing trafic on 4000 or 5000 depending if there is /api has prefix.

My curl request is reponding well :

curl https://plunk.mydomain.com/api
{"code":404,"error":"Not Found","message":"Unknown route","time":1723211643858}

But still, after one minute :

backend-1  | Prisma migrations completed.
backend-1  | Starting the API server...
backend-1  | API server started in the background.
backend-1  | Starting the Dashboard...
backend-1  | Baking Environment Variables...
backend-1  | Environment Variables Baked.
backend-1  |   ▲ Next.js 14.2.5
backend-1  |   - Local:        http://localhost:5000
backend-1  |   - Network:      http://0.0.0.0:5000
backend-1  |
backend-1  |  ✓ Starting...
backend-1  | (node:76) [DEP0040] DeprecationWarning: The `punycode` module is deprecated. Please use a userland alternative instead.
backend-1  | (Use `node --trace-deprecation ...` to show where the warning was created)
backend-1  | ✔  success   [HTTPS] Ready on 4000
backend-1  |  ✓ Ready in 358ms
backend-1  | ℹ  info      Running scheduled tasks
backend-1  | ℹ  info      Updating verified identities
backend-1  | node:internal/deps/undici/undici:13178
backend-1  |       Error.captureStackTrace(err);
backend-1  |             ^
backend-1  |
backend-1  | TypeError: fetch failed
backend-1  |     at node:internal/deps/undici/undici:13178:13
backend-1  |     at processTicksAndRejections (node:internal/process/task_queues:95:5)
backend-1  |     at runNextTicks (node:internal/process/task_queues:64:3)
backend-1  |     at process.processImmediate (node:internal/timers:454:9) {
backend-1  |   [cause]: ConnectTimeoutError: Connect Timeout Error
backend-1  |       at onConnectTimeout (node:internal/deps/undici/undici:2331:28)
backend-1  |       at node:internal/deps/undici/undici:2283:50
backend-1  |       at Immediate._onImmediate (node:internal/deps/undici/undici:2315:13)
backend-1  |       at process.processImmediate (node:internal/timers:483:21) {
backend-1  |     code: 'UND_ERR_CONNECT_TIMEOUT'
backend-1  |   }
backend-1  | }
backend-1  |
backend-1  | Node.js v22.5.1

Is there any file I can update in containers to log the URL right before the call, so I can try it from container POV ?

lionep commented 1 month ago

Just found the fix !

API_URI=http://localhost:4000

It seems this variable is only used internally, so setting it to localhost does not seem to be an issue.

manhtruongwang commented 1 month ago

It is cloudflare if any one wonder, just disable cache for that domain!

Lermatroid commented 1 month ago

None of these fixes seem to be working for me. The login is accessible, but the error in the original issue shows for me in the logs.