Timeouts with large lists

mklahorst commented 3 years ago

We used the API to import our data into the system and create tags. About 5 tags, 1 to 3 tags per subscriber.

Loading the Subscriber List in the UI is slow (15 to 20 seconds) and getting timeouts setting up segments and viewing details and "population" ui gives me a 504.

Subscriber Size: 540k Digital Ocean Server: 2CPUs, 4GB Memory Installed as standalone via Command Line Mailcoach v3, Laravel 8

Wondering if this is solely a performance issue with the server meaning it needs to be upgraded, or something that can be done on the application side to improve performance.

We are planning on sending tomorrow and any advice on fixing the performance would be very helpful.

freekmurze commented 3 years ago

That's strange. We have customers that report that everything works well with lists of that size. It will take us some time to set up at test to investigate this.

If you can already know a fix, feel free to send a PR and we'll try to merge it in quickly.

mklahorst commented 3 years ago

Thanks for getting back.

Since I am on a deadline, I am going to try to resize the server resources first as an intermediate fix. I will post back here if that will server as an intermediate solution.

For you to test on your end, anything specifically I can provide (other than data) to assist?

mklahorst commented 3 years ago

I increased the server to 4 CPUs and 16GB memory as well as extended the max_execution_time to 60 seconds.

I still get a 504 Gateway Timeout accessing this endpoint below to get population:

domain.com/email-lists/2/segments/1/subscribers

Inspecting the network, it looks like that call gets cancelled: https://d.pr/i/4g0zY8

The population size of this segment is 26,078 of 550k+

Additionally, once I am in a list, paginating the page 2, or page 3, for example takes awhile.

My main concern is the sending routine tomorrow. We have tested small lists, but this would be a first large send and concerned about performance.

freekmurze commented 3 years ago

We probably don't have the time to look into this tomorrow. We do have several customers that have reported that they could send emails to lists of a size like yours.

mklahorst commented 3 years ago

Ok. I will keep tweaking on server settings to see if that is the issue. I might have to reduce my database size to that segment for now to hit the deadline tomorrow.

Long term, I would like to solve this as we plan on using this database size.

If anyone else has the same DB size as mine, please DM me to let me know if there are some settings I may be missing.

mklahorst commented 3 years ago

After spending the day on it, I decided to truncate my list size to (hopefully) something manageable at 280k instead of 500k for tomorrow's send without causing issues. Also not using tagging/segmentation and will send directly to that list.

Using Telescope, it appears that the count(*) calls run frequently probably due to metrics data and pagination. But at 5 secs a call, that adds up quick.

I increased my max execution time to 300 seconds, as I am unsure of the time the actual sending processes will take.

Would really like to figure out how to use the package long-term. Let me know what/how you can assist with debugging this.

mklahorst commented 3 years ago

@freekmurze

Here is what happened on my send today. Trying to figure out what the best move is to fix this and continue sending, without resending to people who have already received it.

List Size: 285,056 Tags/Segments: None (Send to entire list)

When first started sending it was showing "x/285056 sent"

Then it switched to the below in the progress bar and has been stuck there for some time.

Campaign Launch Day - List is sending to 12000/12478 subscribers of My List

In Horizon Batches, it shows the Batch of 12000 completed: https://d.pr/i/zzVdku

I do have this failed job: https://d.pr/i/appvup

With these details: https://d.pr/i/iyAO2W

What would my next step be to troubleshoot this and resume sending?

mklahorst commented 3 years ago

I read on another thread about just retrying the job and the system knows not to send someone an email that has already received one. It appears that more jobs got added and another batch created, which is great.

The campaign progress bar updated to:

...... to 12263/15209 .....

I'm guess there is some sort of batching mechanism as to now show the total list size progress.

Any insight you can provide on sending in the future with horizon settings and how to get an overall view of entire campaign progress would be great!

mklahorst commented 3 years ago

Another odd thing, on the second batch, it is showing 105% completed and the status field is empty without a status tag:

launch-day-list-6 | | 14000 | 105%

I went back in and hit "retry" failed job (the same one BTW) and it looks like it added more to the queue with the status message as:

'....14942/17647....'

and after a few minutes, changed to:

'...15003/29291...'

Here are the batch sizes that have been created up to this point:

BTW, the top batch size keeps going up. Is that intended behavior? It's now at 66,000

launch-day-list-6 | Pending | 48000 | 2% | 2020-12-01 13:14:54 launch-day-list-6 | | 14000 | 105% | 2020-12-01 12:09:54 launch-day-list-6 | Finished | 12000 | 100%

Again, just trying to make sure I understand expectations for sending.

Vardkin commented 3 years ago

@mklahorst You are not alone in this. I, too, am experiencing the same issues with sending email taking a long time, even with smaller lists (<8000). My Horizon jobs per hour when sending a campaign is about 2400 and "Max Wait Time (send-mail)" is an hour.

The only differences with my setup from what you have detailed above is that I am not seeing failed jobs and I use the Sendgrid driver in Mailcoach.

In Horizon, my Batches screen is just in a perpetual "Loading" state due to a 404 on the "/horizon/api/batches" endpoint. So, I cannot confirm if I see that same summary issue that you do. I'm currently looking into this.

My campaigns tend to finish with about 15-20% of all emails failing to send. I can usually successfully send the failed sends by using the "Try resending n emails" button in the Outbox tab of the campaign.

The only error that I am experiencing when sending a campaign is a Sendgrid timeout error :

Connection could not be established with host smtp.sendgrid.net :stream_socket_client(): unable to connect to smtp.sendgrid.net:587 (Connection timed out)

Which I have not found a solution for as of yet.

I'll follow up here with any progress I make as emails sending this slowly will be a major hurdle for the marketing team in any firm.

mklahorst commented 3 years ago

@Vardkin

Thanks for the share!

What I found in my use case seems to be very specific to indexing on my local vs production. For whatever reason, the indexing was not working on my production DB unless I updated the mysql command to force the index. I deleted and re-created the indexes multiple times on production but the query would still not pickup. At this point, I didn't want to hack the package code to force indexes on the used queries. The best advice I got at this point was to upgrade to MySQL 8 (even tho the package support Mysql 5.7x). Still not sure why the indexes weren't working on production! Really weird.

Anyways, that was the reason my queries were slow. I originally had 500k records, but truncated to half to get it to work.

The sendgrid.net error does seem weird. Did you find if that was intermittent? or does that always happen? Where there any outages mentioned at Sendgrid.net during that period?

In regards to getting your issue resolved, it possibly sounds like in my case (unknown why Mysql not indexing) and in your case (batches 404 and sendgrid timeouts) might be anomalies and why there hasn't been too many solutions/responses on this thread.

@freekmurze - is there a setting to do a "Dry Run" in Mailcoach? Meaning we can send prep and send a campaign to a list/segment and it goes through all routines EXCEPT actually sending the email. This would be a great way to debug and test bottlenecks.

Let us know!

Vardkin commented 3 years ago

@mklahorst Thank you for the update!

The Sendgrid timeout error shows up every campaign send and our latest send (29k emails) had just over 1100 fails with that message and took 24 hours to send the entire list. Sendgrid was not showing any outages but it may very well be something on their end. I'll keep troubleshooting it.

As far as the indexes in your production DB, are you saying that the mailcoach tables didn't have any indexes on the columns? For instance, the image attached shows the indexes on the mailcoach_subscribers table. Did your DB not have these? Subscriber-indexes

mklahorst commented 3 years ago

@Vardkin

That Sendgrid issue is weird! Are you able to switch to Mailhog for testing (not sure if Mailcoach supports this) and/or a different provider (not ideal for IP and reputation) to see if that solves it.

Yeah, the indexes ARE there. When I run the same query (direct sql testing) on local and use EXPLAIN, it uses the indexes. However, on production, no matter what I tried, it would not use the indexes unless I forced the index in the query.

So definitely more of an issue with the mysql server (innodb) than with Mailcoach (as far as I can tell). I worked with a DBA to try different optimize/repair methods, but at the end his advice was to upgrade to Mysql 8. haha

Vardkin commented 3 years ago

@mklahorst Ah, I see. Did upgrading help? My setup is not yet using MySQL 8 and I could upgrade. Though, it might be quite the hassle and I'd only do it if it's an almost guarantee help for the slow sending of emails.

freekmurze commented 3 years ago

We cannot reproduce this error and several users have reported that they could work with very large list without any performance issues. Closing this for now as it is likely that the problems are caused from a configuration issue on your end.

Feel free to keep the conversion going.

spatie / mailcoach-support

Timeouts with large lists #288