Huge Feedback Backlog - Githubissues

goaround commented 4 years ago

Hi,

I send a daily newsletter to around 40.000 subscribers. I really love Mailcoach because I can assemble my newsletter as I want to. Before my switch to Mailcoach I used MailChimp and run an RSS campaign there. It was only possible to get the content from one RSS feed and no way to sort or personalize the content.

But with Mailcoach I run into a few problems:

I have an AWS SES sending rate of 100 emails/second. The first 10.000 emails go out quite fast but then I get a lot of Feedback and that slows down the sending process a lot. I watched the process and the first 10.000 will be sent in about 5 minutes. The rest, about 30.000 email, need more than 30 minutes.
The backlog on the mailcoach-feedback gets quite huge and it takes hours to process all jobs. Each job needs around 10 seconds to process. That seems quite long? The webhook_calls has grown to over 500.000 rows. Probably that's the problem?
I already changed the processes for the mailcoach supervisor to 10 but I think it would be better to separate the queues or change the balance mode? 7 of 10 processes are always used to process the feedback jobs. Not the send-mail jobs... Any suggestion here?
I normally have a waiting time of about 6-8 hours on the mailcoach-feedback queue. When I look at the graph it seems like not the timestamp from the incoming webhook will be used for the resulting click or open timestamp. Instead, the time when the job get processed will be used. At the moment that's usually 6-8 hours later.

FYI: I run Mailcoach on a Standard / 4 GB / 2 vCPUs DigitalOcean VPS. I could throw more money on it to get more vCPUs but I think it should work with a small VPS, too.

freekmurze commented 4 years ago

Thanks for your detailed report.

The rest, about 30.000 email, need more than 30 minutes.

That's strange. I would expect that the sending itself should stay fast because there's not a lot of DB activity in that job.

Each job needs around 10 seconds to process. That seems quite long?

That's indeed quite long. What we could improve here, is to delete the webhook call if it was processed successfully. I'll look into this soon.

7 of 10 processes are always used to process the feedback jobs.

With the minProcesses option, you can configure Horizon to always have a number of workers reserved for a specific queue. Take a look in the Horizon docs to learn more: https://laravel.com/docs/7.x/horizon

I normally have a waiting time of about 6-8 hours on the mailcoach-feedback queue.

You're right. I'll consider this a bug and will fix that soon.

If you have some suggestions yourself on how to better handle this many mails, I'll open for suggestions 👍

goaround commented 4 years ago

That's indeed quite long. What we could improve here, is to delete the webhook call if it was processed successfully. I'll look into this soon.

I've deleted a lot of old data from the webhook_calls table. Seems to bring down the processing time down. Now 1.4 seconds.

Bildschirmfoto 2020-03-27 um 12 55 28

With the minProcesses option, you can configure Horizon to always have a number of workers reserved for a specific queue. Take a look in the Horizon docs to learn more: https://laravel.com/docs/7.x/horizon

I will try it but I think its per supervisor. Not per Queue? I think it would be better to separate the mailcoach queues into separate supervisors. Can I create a supervisor just for processing the feedback?

raulp commented 4 years ago

@goaround yes you can create separate supervisors per queue.

It seems to me very weird that the webhook processor takes so long for you. Almost seems like your DB is slow for some reason or something else is going on.

I've checked the SES webhook processor and i see a few queries. First, i see it's doing a webhook call db request by ID (should be very fast) Next, a request on the Send model by transport message ID, which should have an index and should be fast. Please confirm for me that you have that index on the mailcoach_sends table. The index is called mailcoach_sends_transport_message_id_unique and should be unique.

Next, depending on the event that you receive, there's a different action that will happen.

Can you please look at your server when this is happening? Like, what is processor load? How about memory, disk I/O? Can you also watch the mysql process list? See if you can find a query that is stuck for a few seconds, meaning a slowdown with your DB? What DB version you have and where it's hosted...

I can't honestly see how deleting the webhook_calls table fixed your issue. That query should be by ID and should be blazing fast.

goaround commented 4 years ago

The size of the webhook_calls table has grown to 2.6 gb. I think that has been the problem. I will setup a cron job for deleting old rows.

I manage my Mailcoach server with Laravel Forge.

AlexVanderbist commented 4 years ago

Hi, I'm currently looking into this and this is what we'll be tracking internally:

Scheduled cleanup for processed webhooks in the webhook_calls table
Change default queue config for standalone Mailcoach install (priority for sending over processing feedback)
Document splitting jobs over multiple queues for the Mailcoach package
Events should be timestamped based on when they are received, not when they are processed
Check demanding queries/jobs and optimize where possible
Run some load tests as a reference

Hopefully this will fix most of the bottlenecks you pointed out.

I'll keep this issue open to post updates when there are any :)

goaround commented 4 years ago

@AlexVanderbist sounds great! Since regular deleting old webhook_calls and separate the feedback / sending queues everything works smooth again.

Runtime for processing SES feedback is now down to 0.35 seconds and stable.

freekmurze commented 4 years ago

I'm going to close this issue for now.

spatie / mailcoach-support

Huge Feedback Backlog #120