knadh / listmonk

High performance, self-hosted, newsletter and mailing list manager with a modern dashboard. Single binary app.
https://listmonk.app
GNU Affero General Public License v3.0
14.63k stars 1.34k forks source link

Possible bug with SMTP connection pool on AWS Lambda Container & AWS SES #1492

Closed symroe closed 9 months ago

symroe commented 1 year ago

Version:

Hi,

I've been working on this bug for some time and I've now run out of options for debugging.

The installation

I am using AWS SES as an SMTP server. Listmonk is deployed on AWS Lambda's container service with a slightly modified Dockerfile (below) and aws-lambda-web-adapter connect it to API Gateway.

I am not in the SES sandbox and I'm not rate limited. I have been using this SES account for some years, and I have 200 emails per second allowed, 50,000 a day.

SMTP Settings

Host: email-smtp.eu-west-1.amazonaws.com (this is the correct region) Port 465 Auth protocol: Plain Username / Password verified correct TLS: SSL/TLS Max connections: 10 Retries: 5 Idle timeout: 5s Wait timeout: 15s

What doesn't work

Sending campaigns, or test emails in the campaign view

What does work

Sending the test email on the SMTP settings page

Errors

I get a mix of errors, all (I think) relating to the SMTP connection pool:

What I've tried

I've tried a number of things in debugging this:

SES ports / protocols

I have attempted to use STARTTLS and TLS each with all the ports AWS listens on. I've also tried PLAIN and LOGIN auth protocols.

Max conections / Performance settings

I've set the max_connections to 1 and limited the concurrent worker threads to 1.

Running locally

Invoking ./listmonk directly on my computer with a config file that's reading from the remote database (to ensure the settings are identical).

This works as expected.

Running local Docker

Building and running the same Dockerfile as I'm using on AWS Lambda, with the same database connection also works as expected.

Debugging Lambda networking issues

Networking is an obvious thing to test, especially with the i/o timeout. I know something is working, due to the SMTP page's test email working when deployed, but I thought it was worth proving more.

I don't know enough Go to set up a language native test, but I did build a Python/Flask app that does two things:

  1. Establishes a telnet connection to the SES SMTP sever I'm using
  2. Sends an email using Python's built in SMTP library

I set up a new Dockerfile using the same base image as I'm using for Listmonk and can verify that Python is able to connect over telnet and send emails from within the same Docker set up. For this reason I don't think it's permissions/firewall/networking/etc related.

Recap

I'm sure that I am able to send emails using Listmonk and the settings I have. I can do this locally and within a Docker container running locally.

I can also send emails using the SMTP page test email feature and standalone Python scripts in the same hosting set up.

In a sort of "approach from both ends" model, the things left in the middle are something to do with the way the Listmonk server is working or something to do with the connection pool library.

I don't really know what I can usefully do to help debug this any more, as it's starting to stretch my knowledge of SMTP and Go.

What I'm fairly sure of is that it's not currently possible to run Listmonk on AWS Lambda containers...or at least that's the single situation that my install is failing in.

Any help with how I might debug this would be very useful!

Files

Listmonk Dockerfile ```Dockerfile FROM listmonk/listmonk COPY --from=public.ecr.aws/awsguru/aws-lambda-adapter:0.7.0 /lambda-adapter /opt/extensions/lambda-adapter ENV TZ=Europe/Dublin # Added as a test, no change ENV LISTMONK_app__address=0.0.0.0:8000 ENV PORT=8000 ENV RUST_LOG=debug COPY static* ./static CMD ["./listmonk", "--static-dir=./static"] ```
Python add DockerFile ```Dockerfile FROM alpine:latest RUN apk --no-cache add ca-certificates tzdata COPY --from=public.ecr.aws/awsguru/aws-lambda-adapter:0.7.1 /lambda-adapter /opt/extensions/lambda-adapter ENV PYTHONUNBUFFERED=1 RUN apk add --update --no-cache python3 && ln -sf python3 /usr/bin/python RUN python3 -m ensurepip RUN pip3 install --no-cache --upgrade pip setuptools ENV PORT=8000 ENV RUST_LOG=debug WORKDIR /var/task COPY app.py requirements.txt ./ RUN python -m pip install -r requirements.txt CMD ["gunicorn", "-b=:8000", "-w=1", "app:app"] ```

I can share the content of the Python app if useful, but it's basically just some simple Python that sends an email, so not super interesting.

symroe commented 1 year ago

In reading other issues, I thought I was on to something when I found this: https://github.com/knadh/smtppool/issues/4

However that seemed to go cold and get closed, and the error I saw wasn't exactly the same. Still, it feels a bit like it's related and to do with how the connection pool is working. Like I say, I'm stretching here!

knadh commented 1 year ago

dial tcp 34.254.18.116:465: i/o timeout dial tcp: lookup email-smtp.eu-west-1.amazonaws.com: i/o timeout

The first error above indicates that the DNS resolved to an IP, but the connection failed. The second error indicates that the DNS lookup itself failed. These errors arise from Go's underlying (stdlib) socket/TCP library, so it's highly unlikely that it's emanating from the higher levels in the smtppool lib or listmonk.

This definitely has to be something to do with the specific Docker environment (or something like SELinux?) getting in the way of the network connection. @mr-karan any insights?

However that seemed to go cold and get closed, and the error I saw wasn't exactly the same. @symroe the OP ended the thread saying the issue was to do with a full disk and clearing that fixed it. A mutex deadlock would cause the program to hang and wouldn't throw TCP errors.

mr-karan commented 1 year ago

Great investigation efforts @symroe

Firstly, I am not sure if Lambda is a suitable environment for running a newsletter manager. Since it has time restrictions of 15 minutes. In case email processing finishes within that time, then probably fine.

The first thing I'd din this case is to replace Alpine with ubuntu:latest to debug DNS issues. Alpine has musl based libc instead of glibc and that has been notorious for DNS issues. Would it be possible for you to build a custom image and try with that?

symroe commented 1 year ago

Thanks both!

issue was to do with a full disk and clearing that fixed it

Sorry I missed that from the thread :see_no_evil:

I am not sure if Lambda is a suitable environment for running a newsletter manager. Since it has time restrictions of 15 minutes. In case email processing finishes within that time, then probably fine.

I did wonder about this, but I have a known size email list of under 50,000 and a sending rate of 200 per second allowed. We send about 4 emails a year to the list, so I'm hoping Listmonk can mostly just do nothing the rest of the time :)

So, @mr-karan's suggestion to change to ubuntu:latest has changed something.

First off, I got my first email from the campaign test page :tada:. However, I can't get any more emails after that one. I send the first one almost as soon as the app started for the first time on Lambda, so I suspect it was using the initial connections in the connection pool, rather than refreshed connections?

I've then had some new errors, and some old. In order:

  1. EOF
  2. SMTP AUTH extension not found
  3. 34.254.18.116:465: i/o timeout
  4. lookup email-smtp.eu-west-1.amazonaws.com: i/o timeout

So, two new, two old.

Just to make sure I've not done something odd with my Dockerfile: I simply added ubuntu:latest to the top. Might this have created some odd mix of Alpine and Ubuntu?

Dockerfile ```dockerfile FROM ubuntu:latest FROM listmonk/listmonk COPY --from=public.ecr.aws/awsguru/aws-lambda-adapter:0.7.0 /lambda-adapter /opt/extensions/lambda-adapter ENV TZ=Europe/Dublin ENV LISTMONK_app__address=0.0.0.0:8000 ENV PORT=8000 ENV RUST_LOG=debug COPY static* ./static CMD ["./listmonk", "--static-dir=./static"] ```
symroe commented 1 year ago

After posting the above, I got another context deadline exceeded error on another test email.

mr-karan commented 1 year ago

That Dockerfile doesn't look correct to me. It should be something on the lines of (I've not tested it actually):

FROM listmonk/listmonk:latest AS listmonk_base
FROM ubuntu:latest
RUN apt-get update && apt-get install -y ca-certificates tzdata && rm -rf /var/lib/apt/lists/*
WORKDIR /listmonk
COPY --from=listmonk_base /listmonk/listmonk /listmonk/
COPY --from=listmonk_base /listmonk/config.toml /listmonk/
COPY --from=listmonk_base /listmonk/config-demo.toml /listmonk/
CMD ["./listmonk"]
EXPOSE 9000
symroe commented 1 year ago

Ok thanks for that. I have changed my Dockerfile to;

Dockerfile ```dockerfile FROM listmonk/listmonk:latest AS listmonk_base FROM ubuntu:latest COPY --from=public.ecr.aws/awsguru/aws-lambda-adapter:0.7.0 /lambda-adapter /opt/extensions/lambda-adapter RUN apt-get update && apt-get install -y ca-certificates tzdata && rm -rf /var/lib/apt/lists/* WORKDIR /listmonk COPY --from=listmonk_base /listmonk/listmonk /listmonk/ COPY --from=listmonk_base /listmonk/config.toml /listmonk/ ENV LISTMONK_app__address=0.0.0.0:8000 ENV PORT=8000 ENV RUST_LOG=debug COPY static* ./static CMD ["./listmonk", "--static-dir=./static"] ```

It works, in that the deployment doesn't fail and I can use the app, but I get the same errors when attempting to send:

mr-karan commented 1 year ago

Can you try installing netcat-openbsd package in that image and do these commands specifically instead of wrapping them in the Python env

nc -vz -w 3 34.242.32.157 465

It definitely looks like some networking configuration that's not setup properly (do the lambdas have access to outgoing SMTP ports, since AWS does restrict outbound/inbound common SMTP ports on EC2 as well).

I'm out of ideas as to what else could be happening.

symroe commented 1 year ago

Ok I've done some more debugging.

First off though, remember I can send emails using the SMTP tester on the settings page, and via Python. This, to me, suggests networking isn't the problem, as if it were I'd never be able to send email. I also have got a single email using the campaign tester, but only in the first couple of seconds after the image cold starts.

As per your suggestion, I've used by Python helper function to run nc.

This is using Python to drive the system process using subprocess.check_output, so it's nothing to do with Python's networking systems.

It returns:

34.242.32.157 (34.242.32.157:465) open

So, I think that shows that I can make a connection to that IP address.

To go a step further, I also wrote a script in Go (my first!) that does nothing but send an email. I can include the code if you like, but it's using crypto/tls and net/smtp. This also manages to send an email from the deployment.

So:

  1. Lambda can connect to SES over SMTP
  2. Python and Go scripts can send email
  3. Listmonk's SMTP tester can send email
  4. The campaign page sent an email once

It looks to me like there's a problem with the way the pool is connecting / reusing connections. I really can't see how or why, but I feel like I've isolated every other moving part.

symroe commented 1 year ago

I've also just added nc as part of the CMD in the actual Dockerfile I'm using to run listmonk:

Connection to 34.242.32.157 465 port [tcp/*] succeeded!

(CMD is now a sh script that runs nc && listmonk)

symroe commented 1 year ago

I've also added dig to the run.sh file and get the expected answer email-smtp.eu-west-1.amazonaws.com. 38 IN A 34.243.204.8 (plus other rows, point is I get a valid response so it's not DNS)

symroe commented 1 year ago

Here's the program I made. I've never written Go before and I didn't attempt to make it nice in anyway. I wanted the shortest route to sending an email :D Copy/paste from the internet. It works locally and on the same Lambda/Docker set up as I'm using for Listmonk.

Do let me know if there's anything else I could change in this script to be more like the way the SMTP pool works.

go_email file ```go package main import ( "crypto/tls" "fmt" "log" "net" "net/smtp" "time" ) func main() { //set the email email := "XXXXXX" subject := "Test from Go" body := "This is the body part you send whatever you want in the body." err := sendEmail(email, subject, body) if err != nil { //if you get password error please make sure your password is correct. //Or you have followed all the steps fmt.Println(err) return } } func sendEmail(email, subject, body string) error { from := "XXXXXX" username := "XXXXX" password := "XXXXXX" host := "email-smtp.eu-west-1.amazonaws.com" // port := 465 // Setup headers headers := make(map[string]string) headers["From"] = from headers["To"] = "XXXXX" headers["Subject"] = "Test from Google Go" // Setup message message := "" for k, v := range headers { message += fmt.Sprintf("%s: %s\r\n", k, v) } message += "\r\n" + body // TLS config tlsconfig := &tls.Config{ InsecureSkipVerify: true, ServerName: host, } conn, err := tls.DialWithDialer(&net.Dialer{Timeout: time.Second * 2}, "tcp", "email-smtp.eu-west-1.amazonaws.com:465", tlsconfig) if err != nil { log.Panic(err) } c, err := smtp.NewClient(conn, host) if err != nil { log.Panic(err) } auth := smtp.PlainAuth("", username, password, host) // Auth if err = c.Auth(auth); err != nil { log.Panic(err) } // To && From if err = c.Mail(from); err != nil { log.Panic(err) } if err = c.Rcpt("XXXXXX"); err != nil { log.Panic(err) } // Data w, err := c.Data() if err != nil { log.Panic(err) } _, err = w.Write([]byte(message)) if err != nil { log.Panic(err) } err = w.Close() if err != nil { log.Panic(err) } c.Quit() return err } ```
knadh commented 1 year ago

This is exceptionally strange ; ( The test e-mail from SMTP settings and the campaign e-mails both use the exact same underlying pool mechanism to e-mails.

symroe commented 1 year ago

Well in some ways I'm glad it's not just me missing something obvious!

I keep coming back to these lines:

https://github.com/knadh/smtppool/blob/master/pool.go#L166-L180

I wonder if there's a way that p.opt.TLSConfig != nil && p.opt.SSL is evaluating to False and I'm falling back to net.DialTimeout("tcp", addr, p.opt.PoolWaitTimeout). This might cause the connection to hang and timeout, if it's not using TLS?

knadh commented 1 year ago

That is highly unlikely. The config can't selectively disappear only in the campaign context.

Since you've written a test Go script, could you perhaps try it with the smtppool lib? The README has a simple program that you can copy and use as-is.

symroe commented 1 year ago

Ok I've written the following script (again, sorry if this is bad Go in some ways, I'm just hacking until something works!)

go_smtp_pool ```go package main import ( "crypto/tls" "fmt" "github.com/knadh/smtppool" "log" "net/smtp" "time" ) func main() { username := "xxxxx" password := "XXXXX" host := "email-smtp.eu-west-1.amazonaws.com" auth := smtp.PlainAuth("", username, password, host) // TLS config tlsconfig := &tls.Config{ InsecureSkipVerify: true, ServerName: host, } // Try https://github.com/mailhog/MailHog for running a local dummy SMTP server. // Create a new pool. pool, err := smtppool.New(smtppool.Opt{ Host: host, Port: 465, Auth: auth, SSL: true, TLSConfig: tlsconfig, MaxConns: 10, IdleTimeout: time.Second * 10, PoolWaitTimeout: time.Second * 3, }) if err != nil { log.Fatalf("error creating pool: %v", err) } for i := 0; i < 10; i++ { formatString := generateFormatString(i + 1) _, err := fmt.Printf(formatString) if err != nil { fmt.Printf("Error: %v\n", err) } e := smtppool.Email{ From: "Foo <[actual email]>", To: []string{"XXXX"}, Subject: formatString, Text: []byte("This is a test e-mail"), HTML: []byte("This is a test e-mail"), } if err := pool.Send(e); err != nil { log.Fatalf("error sending e-mail: %v", err) } time.Sleep(15 * time.Second) } } func generateFormatString(iteration int) string { return fmt.Sprintf("Iteration %d\n", iteration) } ```

So the point here is to simulate the pool sending emails over time. I set the time-out to 10 seconds and the iteration sleep time to 15 seconds to make sure that each connection was new.

Importantly, I get some emails every time I run the script. Locally I get all 10, on Lambda I get some and then I get an error.

In this case I got iteration 1, 2 and 3, but then it stopped:

Iteration 1
Iteration 2
[2023-09-01 09:57:10 +0000] [11] [CRITICAL] WORKER TIMEOUT (pid:13)
Iteration 3
DEBUG hyper::proto::h1::conn: parse error (connection closed before message completed) with 0 bytes
DEBUG hyper::proto::h1::dispatch: read_head error: connection closed before message completed
ERROR Lambda runtime invoke{requestId="d54afc58-ef3b-4514-8646-07aac9182408" xrayTraceId="Root=1-64f1b556-159d245651248ebd1ee39659;Parent=0bdebd3d7bd4357a;Sampled=0;Lineage=dd201e7f:0"}: lambda_runtime: hyper::Error(IncompleteMessage)
DEBUG Lambda runtime invoke{requestId="d54afc58-ef3b-4514-8646-07aac9182408" xrayTraceId="Root=1-64f1b556-159d245651248ebd1ee39659;Parent=0bdebd3d7bd4357a;Sampled=0;Lineage=dd201e7f:0"}: hyper::client::pool: reuse idle connection for ("http", 127.0.0.1:9001)
[2023-09-01 09:57:11 +0000] [11] [WARNING] Worker with pid 13 was terminated due to signal 9

The Lambda timeout is 4 minutes, so it's not that that's killing the process. Signal 13 looks like a broken pipe, and that suggests something to do with closed connections, but I really don't know. It's possible I'm just seeing a bug in my code rather than anything else.

Does this help at all? The main point is that the SMTP pool can send emails from my deployment.

knadh commented 1 year ago

The script looks fine. It seems like on lambda, "long" standing connections drop and further connections randomly fail for various random reasons (unreachable IP, timeout, DNS lookup failure).

mr-karan commented 1 year ago

Firstly, I am not sure if Lambda is a suitable environment for running a newsletter manager

Back to where it started :D

Lambda's typically even though on the face of it may seem like executing a function is similar to how it's run on a VM with virtual cores/memory, but behind the scenes a lot of different things could be running.

Also, just found this from the official docs, potentially related.

Use a keep-alive directive to maintain persistent connections. Lambda purges idle connections over time. Attempting to reuse an idle connection when invoking a function will result in a connection error. To maintain your persistent connection, use the keep-alive directive associated with your runtime. For an example, see Reusing Connections with Keep-Alive in Node.js.

knadh commented 1 year ago

Whoa, that keep-alive directive seems to explain exactly what's going on. Lambdas kill long standing connections and then don't allow re-connections.

symroe commented 1 year ago

So yes, I guess the tl;dr is that Lambda is doing something that makes this not work, however I'm not sure the keep-alive thing is quite right.

For example, we have a database connection that's long standing and isn't failing at all, and I don't see how that should be different to a STMP connection (both are going to the open internet, even though I'm using RDS as a database).

I'm going to try one last thing: I've modified smtppool to add loads of debugging prints in. I'm going to build listmonk with the modded smtppool and deploy that in order to get a bit more info on what's going on. I'll post the results of that here for info.

Unless that throws up anything obvious, I'll abandon ship and move to ECS / some other service. Hopefully this issue has been useful for others in future though!

mr-karan commented 1 year ago

I'll abandon ship and move to ECS / some other service. Hopefully this issue has been useful for others in future though!

Just a suggestion that might work for your usecase: Running EC2 spot instances. Since your emailing would finish <15mins, spot instances are ideal for this workload. You can spin up a small EC2 instances and write cloud-init scripts which can run listmonk, start the campaigning and trigger a shutdown/destroy of the instance after email processing is finished (would need some system to monitor the pending count from the DB, but doable).

symroe commented 1 year ago

I'll abandon ship

I'm not quite ready to do this yet, so I thought I'd document my findings here for others :laughing:

I realised the above error Worker with pid 13 was terminated due to signal 9 was unrelated. I'm running my python using gunicorn and Flask (quickest way to get a webserver I can talk to using aws-lambda-web-adapter, and I'm doing this to keep the set up as close to the listmink deploy as I can). Gunicorn was killing the Flask worker after 30 seconds: the default HTTP timeout.

I've done three things since:

  1. Disabled timeout in gunicorn, so the process will run for as long as the process takes
  2. Built a version of smtppool with a load of debugging statements in
  3. Pulled the email sending code in to a function and called that in a goroutine go send_email(pool). This was to make sure that there wasn't anything wrong with threads / goroutines, Docker and Lambda

My main function now looks like this:

func main() {

    if err != nil {
        log.Fatalf("error creating pool: %v", err)
    }
    println("sending emails from goroutine")
    go send_email(pool)
    println("Called send_email goroutine, sleeping")
    time.Sleep(time.Second * 100)
    println("Stop run")

}

I get emails as expected. As you can see, both calls to the funciton use the same pool, and I get emails using both methods (e.g 2 iteration 1, 2 iteration 2, etc).

This is the output I get, with my modded smtppool debugging statements:

Outout

```go Start the idle connection sweeper.sending emails from goroutine -- Called send_email goroutine, sleeping Iteration 1 Sweeping connections &{opt:{Host:email-smtp.eu-west-1.amazonaws.com Port:465 HelloHostname: MaxConns:10 MaxMessageRetries:2 IdleTimeout:10000000000 PoolWaitTimeout:3000000000 SSL:true Auth:0x7a44a0 TLSConfig:0x79ff60} conns:0xc000098120 createdConns:0 lastActivity:{wall:0 ext:0 loc:} mut:{state:0 sema:0} stopBorrow:0xc0000800c0 closed:false} Making a new connection and adding it to the pool Opening a new TLS connection Connect to the SMTP server Returning connection trying to get a connection connection: &{conn:0xc000104300 numErr:0 lastActivity:{wall:0 ext:0 loc:}} Closing conection: &{conn:0xc000104300 numErr:0 lastActivity:{wall:13921992918689173038 ext:218427286 loc:0x7a45c0}} Iteration 2 &{opt:{Host:email-smtp.eu-west-1.amazonaws.com Port:465 HelloHostname: MaxConns:10 MaxMessageRetries:2 IdleTimeout:10000000000 PoolWaitTimeout:3000000000 SSL:true Auth:0x7a44a0 TLSConfig:0x79ff60} conns:0xc000098120 createdConns:0 lastActivity:{wall:0 ext:0 loc:} mut:{state:0 sema:0} stopBorrow:0xc0000800c0 closed:false} Making a new connection and adding it to the pool Opening a new TLS connection Connect to the SMTP server Returning connection trying to get a connection connection: &{conn:0xc000070100 numErr:0 lastActivity:{wall:0 ext:0 loc:}} Closing conection: &{conn:0xc000070100 numErr:0 lastActivity:{wall:13921992934200210786 ext:14623337674 loc:0x7a45c0}} Iteration 3 &{opt:{Host:email-smtp.eu-west-1.amazonaws.com Port:465 HelloHostname: MaxConns:10 MaxMessageRetries:2 IdleTimeout:10000000000 PoolWaitTimeout:3000000000 SSL:true Auth:0x7a44a0 TLSConfig:0x79ff60} conns:0xc000098120 createdConns:0 lastActivity:{wall:0 ext:0 loc:} mut:{state:0 sema:0} stopBorrow:0xc0000800c0 closed:false} Making a new connection and adding it to the pool Opening a new TLS connection Connect to the SMTP server Returning connection trying to get a connection connection: &{conn:0xc000070180 numErr:0 lastActivity:{wall:0 ext:0 loc:}} Closing conection: &{conn:0xc000070180 numErr:0 lastActivity:{wall:13921992949520408587 ext:28911149938 loc:0x7a45c0}} Iteration 4 &{opt:{Host:email-smtp.eu-west-1.amazonaws.com Port:465 HelloHostname: MaxConns:10 MaxMessageRetries:2 IdleTimeout:10000000000 PoolWaitTimeout:3000000000 SSL:true Auth:0x7a44a0 TLSConfig:0x79ff60} conns:0xc000098120 createdConns:0 lastActivity:{wall:0 ext:0 loc:} mut:{state:0 sema:0} stopBorrow:0xc0000800c0 closed:false} Making a new connection and adding it to the pool Opening a new TLS connection Connect to the SMTP server Returning connection trying to get a connection connection: &{conn:0xc000070480 numErr:0 lastActivity:{wall:0 ext:0 loc:}} Closing conection: &{conn:0xc000070480 numErr:0 lastActivity:{wall:13921992964884737537 ext:43243093337 loc:0x7a45c0}} Iteration 5 &{opt:{Host:email-smtp.eu-west-1.amazonaws.com Port:465 HelloHostname: MaxConns:10 MaxMessageRetries:2 IdleTimeout:10000000000 PoolWaitTimeout:3000000000 SSL:true Auth:0x7a44a0 TLSConfig:0x79ff60} conns:0xc000098120 createdConns:0 lastActivity:{wall:0 ext:0 loc:} mut:{state:0 sema:0} stopBorrow:0xc0000800c0 closed:false} Making a new connection and adding it to the pool Opening a new TLS connection Connect to the SMTP server Returning connection trying to get a connection connection: &{conn:0xc000104c00 numErr:0 lastActivity:{wall:0 ext:0 loc:}} Closing conection: &{conn:0xc000104c00 numErr:0 lastActivity:{wall:13921992980344844021 ext:57596670749 loc:0x7a45c0}} Iteration 6 &{opt:{Host:email-smtp.eu-west-1.amazonaws.com Port:465 HelloHostname: MaxConns:10 MaxMessageRetries:2 IdleTimeout:10000000000 PoolWaitTimeout:3000000000 SSL:true Auth:0x7a44a0 TLSConfig:0x79ff60} conns:0xc000098120 createdConns:0 lastActivity:{wall:0 ext:0 loc:} mut:{state:0 sema:0} stopBorrow:0xc0000800c0 closed:false} Making a new connection and adding it to the pool Opening a new TLS connection Connect to the SMTP server Returning connection trying to get a connection connection: &{conn:0xc000104d00 numErr:0 lastActivity:{wall:0 ext:0 loc:}} Closing conection: &{conn:0xc000104d00 numErr:0 lastActivity:{wall:13921992995956728534 ext:72176169718 loc:0x7a45c0}} Iteration 7 &{opt:{Host:email-smtp.eu-west-1.amazonaws.com Port:465 HelloHostname: MaxConns:10 MaxMessageRetries:2 IdleTimeout:10000000000 PoolWaitTimeout:3000000000 SSL:true Auth:0x7a44a0 TLSConfig:0x79ff60} conns:0xc000098120 createdConns:0 lastActivity:{wall:0 ext:0 loc:} mut:{state:0 sema:0} stopBorrow:0xc0000800c0 closed:false} Making a new connection and adding it to the pool Opening a new TLS connection Connect to the SMTP server Returning connection trying to get a connection connection: &{conn:0xc000070a80 numErr:0 lastActivity:{wall:0 ext:0 loc:}} Closing conection: &{conn:0xc000070a80 numErr:0 lastActivity:{wall:13921993011368365347 ext:86481679172 loc:0x7a45c0}} Stop run ```

As you can see, the pool is dropping and re-making connections as I'd expect. I've not seen any timeout or dial errors.

SO...I still don't know what's going on, but I don't think it's as simple as "Lambda can't do this".

I'm going to keep reading through the listmonk code to see what else is going on that might allow me to re-create this. I'm not saying it's a bug in listmonk / smtppool, but I would like to understand how to re-create it as a standalone example.

I think my next step is to implement a Go server that can call the send_email function as a goroutine. I'll update here if I get anywhere.

knadh commented 11 months ago

Hi @symroe. Wondering if you were able to gain any further insights on this.

symroe commented 11 months ago

@knadh Sorry I've not been able to work on this any more.

I'm convinced that there is something wrong with the way SMTP pool works (or is being used in ListMonk) specifically on the AWS Lambda environment. The next steps I need to do is the basically keep adding to my test code until I find the thing that breaks, and hope I've not re-build ListMonk in the process! :laughing:

I think for the time being though, it's fair to say that ListMonk DOES NOT work on AWS Lambda, but for an unknown reason. There's no real reason why it couldn't work, though.

I think that working on properly pluggable backends (including for core emails) as per https://github.com/knadh/listmonk/issues/1486 is more important than fixing whatever this problem is.

If I could have used SES or some other service by default, and to send all messages, then I'd not needed to have attempted to fix SMTP issues.

In the end I used ECS. This isn't as good as also needed to set up an ALB etc that costs money to do nothing.

This project is so close to working on Lambda and providing an almost free solution. Sorry I've not been able to figure out the actual issue yet....hopefully the above will help you or someone else in future!

Thanks so much for all your help...I'll leave you to decide if you want to keep this issue open or not.

dwrolvink commented 9 months ago

Having the same issue, but not on AWS.

Below is our internal troubleshooting report, which might be useful to add context to this issue.

We run listmonk on Kubernetes using the dockerhub image.

Issue

Using the below two options results in a logged error such as:

kubectx cluster2
kubens listmonk
kubectl logs listmonk-58794bb499-kjzwj

...
2023/12/11 10:49:26 manager.go:422: error sending message 'Finished: testmail': dial tcp 188.114.97.0:25: i/o timeout
data: {"id":"","type":"error","message":"2023/12/11 10:49:26 manager.go:422: error sending message 'Finished: testmail': dial tcp 188.114.97.0:25: i/o timeout\n","data":null}

Root cause

It seems that Listmonk caches the DNS result of relay-1.foo.bar in the last two instances, but not in the first instance. This issue would arise when the DNS record changes and we try to send emails before the DNS cache expires.

Fix

Configuring relay-2.foo.bar and then clicking Save will switch Listmonk to the other relay, which is most probably not cached, and this will fix the issues mentioned. Interestingly, immediately switching back to relay-1.foo.bar will also work, so apparently the ip cache is overwritten whenever the SMTP host is set.

knadh commented 9 months ago

It seems that Listmonk caches the DNS result of relay-1.foo.bar in the last two instances, but not in the first instance.

Hi @dwrolvink. Go's stdlib net.Dial (which listmonk uses) does not cache DNS queries and instead relies on the operating system's resolver and caching behaviour.