veda-consulting-company / uk.co.vedaconsulting.mosaico

Other
39 stars 74 forks source link

Not sending mail to larger groups (over 400 contacts) #167

Closed stoporko closed 6 years ago

stoporko commented 7 years ago

Hi,

one problem with Drupal - civiCRM - Mosaico

Everything is working fine with mosaico 2. Test e-mails are fine, system is sending mails to groups... but: when the target group has more than 400 contacts, the mail never gets sent. When I click in civicrm on administration > system > send scheduled e-mails, page throws Error 500. When I run civiCRM cron-job from our host, it works fine, but no e-mail gets sent anyway. Where could be the problem? Nothing in logs so far. Many thanks with your help!

mattwire commented 7 years ago

Was this happening before mosaico was installed. What server/extension are you using to send email?

kngs commented 7 years ago

FYI, my test drive mailing was bombing between 600-750 recipients with memory error. By setting my Mailer Batch Limit to 200 I was able to send (to log) to over 2000 recipients.

stoporko commented 7 years ago

mattwire: this happened only after mosaico was installed.

kngs: I will give it a try, thanks for an idea :)

edit: kngs - your solution solved the problem. Many thanks. Strange, that this happend only after mosaico was installed.

mattwire commented 6 years ago

This has been happening to me recently. I've tracked it down to mosaico and flexmailer UrlFilter but have been unable to find a way of fixing it.

I believe the issue is actually in CRM_Mosaico_UrlFilter::filterHtml / flexmailer SimpleFilter which calls the function. Basically memory is not freed until ALL mailings have been parsed through the preg_replace_callback functions (you can see this by inserting a Civi::log()->debug(memory_get_usage()) into the callback function of preg_replace_callback in filterHtml). If you comment out those functions the memory usage stays almost constant throughout the execution. For larger mailings I mean anything over around 200 is using > 128MB memory. @totten Do you have any thoughts? I've tried running in batches etc but nothing really seems to help other than actually removing the calls to preg_replace_callback

pbatroff commented 6 years ago

It seems I have the same problem, but apparently it only happens with larger mailings to a lot of recipients (> 30k), AND with a lot of images. It worked with the same amount of people, but smaller Mailing size.

Here is the cron output:

sudo -u www-data -g www-data /usr/local/bin/drush civicrm-api -u 1 job.execute

PHP Fatal error:  Allowed memory size of 1073741824 bytes exhausted (tried to allocate 279789 bytes) in /var/www/[...]/sites/default/files/civicrm/ext/uk.co.vedaconsulting.mosaico/CRM/Mosaico/UrlFilter.php on line 54
Fatal error: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 279789 bytes) in /var/www/[...]/sites/default/files/civicrm/ext/uk.co.vedaconsulting.mosaico/CRM/Mosaico/UrlFilter.php on line 54
Drush command terminated abnormally due to an unrecoverable error.                                      [error]
Error: Allowed memory size of 1073741824 bytes exhausted (tried to allocate 279789 bytes) in
/var/www/[...]/sites/default/files/civicrm/ext/uk.co.vedaconsulting.mosaico/CRM/Mosaico/UrlFilter.php,
line 54

Code uk.co.vedaconsulting.mosaico/CRM/Mosaico/UrlFilter.php (line 54):

$htmls = preg_replace_callback(';(\<img [^>]*src *= *\')([^">]+)(\');i', $callback, $htmls);

Short term help was to set a batch limit, but that's no real solution. Memory limit is already at 1024 MB, tried with more as well.

My guess is, that the image handling has something to do with this. It worked with mailing templates with less images, and the preg_replace_callback that fails is the one fiddling with the image tags.

mattwire commented 6 years ago

@pbatroff It seems to be that the multiple calls to preg_replace_callback don't release memory until the entire batch has been processed. For testing, you can comment those lines out and you will see a nice stable memory usage. I tried a few things but I don't know how to fix it :-(

pbatroff commented 6 years ago

@mattwire Ok thanks for the update. Do I comment out ALL of the preg_replace_callback? And is the template corrupted then? I can test this week with another mailing I suppose. Will get back with feedback!

mattwire commented 6 years ago

@pbatroff Yes, you must comment out ALL of the preg_replace_callback to see the stable memory usage. But please take care because I don't know if any of the checks in that callback may be required for your mailing!

I think the real solution is to find some way that the filter event in flexmailer can release memory in between processing every email because it will always be possible that someone will implement another filter in an extension that uses a lot of memory.

mattwire commented 6 years ago

@pbatroff In fact, you can see this even with just a small number of emails (eg. 10) and a Civi::log()->debug(memory_get_usage()) in the callback. Every email processed increases the memory usage by a fixed amount.

totten commented 6 years ago

That sounds like a nice investigation tracking down memory usage around preg_replace_callback()!

In principle, either of these should improve the memory characteristics:

I think the real solution is to find some way that the filter event in flexmailer can release memory in between processing every email because it will always be possible that someone will implement another filter in an extension that uses a lot of memory.

For a general/systemic fix, lowering the CiviMail settings for batch/job size should have that effect. And perhaps the setting is more important in FlexMailer's batch-oriented pipeline than it was in the CRM_Mailing_BAO's pipeline? It might make sense for FlexMailer to a have status-check which generates a warning if your batch settings have no limits.

pbatroff commented 6 years ago

Hey @totten, thanks for the input. I knew about the batch settings, and while that makes sense, it doesn't really work well if you want to send out emails fast. I will test your patch this week though and get back to you with feedback.

In general I think a combination of both the patch and a reasonable batch setting for your setup makes the most sense in my opinion!

totten commented 6 years ago

... if you want to send out emails fast

It's useful to describe this a bit more precisely -- because I've heard a few theories on what makes for faster delivery. Let me try to phrase the claims in similar terms (although this requires a little extrapolation from the arguments):

  1. Sending mailings with concurrent, multithreaded delivery allows for faster delivery (eg because you take advantage of more hardware/network resources in parallel). This incidentally requires more batching (so that each job gets a different batch).
  2. Sending mailings with batched I/O allows for faster delivery (eg because you don't need to open/close as many network connections). This incidentally requires batching in memory (so that you don't overload memory or I/O).
  3. Sending mailings with one thread allows allows for faster delivery (eg because you don't wait for multiple cron-runs; eg because you don't have to bootstrap multiple times). This incidentally requires a continuous stream.

Personally, I'd wager more on #1 than #2 or #3. But data is better than theory.

If your thinking is more along the lines of #3, then maybe try:

This should give you one thread (Mailer Cron Job Limit) which delivers all messages (Mailer Batch Limit) while still capping the number of messages processed concurrently by flexmailer batch (Mailer Job Size). The flexmailer batches should be dictated by whichever limit is stricter (Mailer Batch Limit or Mailer Job Size).

pbatroff commented 6 years ago

Sending mailings with concurrent, multithreaded delivery allows for faster delivery (eg because you take advantage of more hardware/network resources in parallel). This incidentally requires more batching (so that each job gets a different batch).

I agree that this would theoretically be the best way to go, but there are some caveats as well! It depends a bit on the rate in which cron jobs are triggered. Let's assume the folowing:

With those assumptions it's kinda hard to run concurrent jobs, unless I misunderstood something in the Mail Configuration, and would only start at 30 minutes into the sending. This would imply to set a batch limit high enough for 30 minutes of sending. Depending on the mail Server that would be at least ~15k Mails I would guess. Hence the memory problem in this threat.

What could be done, is deactivate the sendMail cronjob in CiviCRM, and trigger it manually, lets say once per minute. This way a reasonable batch limit and Job Size could be configured, and the memory Problem could be avoided. But this brings a couple of problem in itself, mainly:

Nonetheless I'm going to test your patch now, and will post feedback here!

pbatroff commented 6 years ago

@totten back with some results and numbers! I made two measurements, both with the exact same mailing, containing six images and a small amount of text. I attached debug memory consumption code in two separate locations in the code:

$callback = function ($matches) use ($stdBase, $domainBase) {
      Civi::log()->debug("IN CALLBACK START > " . memory_get_usage());

and

    Civi::log()->debug("Pre Callback > " . memory_get_usage());
    $htmls = preg_replace_callback(';(\<img [^>]*src *= *")([^">]+)(");i', $callback, $htmls);
    $htmls = preg_replace_callback(';(\<img [^>]*src *= *\')([^">]+)(\');i', $callback, $htmls);
    $htmls = preg_replace_callback(';(\<table [^>]*background *= *")([^">]+)(");i', $callback, $htmls);
    $htmls = preg_replace_callback(';(\<table [^>]*background *= *")([^\'>]+)(\');i', $callback, $htmls);
    Civi::log()->debug("Post Callback > " . memory_get_usage());

I created 20k users in CiviCRM, a test group and said mailing. To emulate a real sending, I used FakeSMTP.

The Results can be found here:

Clearly the Memory Consumption is better with the use of the SimpleFilter::byValue function, though it seems to fluctuate a bit, and there still seems to be some memory accumulating, but not as much. The bottom of the saw line is increasing.

php_memory_limit was 512MB, Mosaico 2.0.4beta and FlexMailer 0.2-alpha5. CiviCRM Config was all default: No batch limit, single threat. If needed I can provide the detailled results with you as well, but I'm not sure if this threat is the right platform for that. Feel free to contact me.

Also the number of log files (#logs, x-axis) differs, because the new version has much more log file output, but the memory consumption is accurate I guess.

In my opinion this patch makes a lot of sense, since otherwise there is always a chance that the mailing wont be sent, and it depends on a lot of factors:

If it fails, it's quite hard to debug, and trying to figure out the optimal sending rate without breaking CiviCRM is really hard.

totten commented 6 years ago

Very nice analysis, @pbatroff! I agree with merging the patch.

So, overall, it looks like the patch reduces peak memory usage by ~30% and doesn't seem to cause any hard-breaks during execution. :) The overall saw-shape seems pretty plausible (in hand-waivy sense, you'd expect garbage collectors to run periodically).

It is interesting how the bottom of the saw line increases slightly. FWIW, this could be because UrlFilter is actually making longer messages that need more RAM (i.e. absolute URLs are longer than relative URLs -- ballpark +10 to +50 chars per URL). At a volume of 20k messages with 1-2 URL conversions in each, one would get into the same ballpark displayed in the chart (baseline grows by ~1mb). That's just a theory, of course... but even if some other/reducable factor were at play, that 1mb doesn't feel too important.

In addition, I still think we should add sign-posts to FlexMailer (eg hook_civicrm_check) emphasizing that batch-limits and memory-limits have added significance. In the long-run, it's fairly easy for future patches/extensions/deployers to raise similar issues.

mattwire commented 6 years ago

@stoporko This has now been fixed by #248 Can you close this issue?