putyourlightson / craft-campaign

Send and manage email campaigns, contacts and mailing lists in Craft CMS.
https://putyourlightson.com/plugins/campaign
Other
64 stars 24 forks source link

Over eager marking of contacts as bounced. #178

Closed bossanova808 closed 3 years ago

bossanova808 commented 3 years ago

Question

I've noticed that we're seeing a LOT more contacts marked as bounced in Campaign vs. other systems previously used. And some of these contacts have been with us 10+ years, and on manual checking their email is in fact working just fine. Indeed, I can't find these people in the Mailgun end of things either - in our bounced list on Mailgun, we have 109 permanent bounces recorded (after ~4 years use).

In Campaign, there are currently 489 after just a few months use (our mailing list is about 9000 in size). So something is a bit out of whack I think. (As ever, it might be my understanding...)

This doesn't make sense to me and it's a concern as these are people we do need to be able to contact. And it appears after a single bounce event of any type, that's it, they're permanently in the bounced state and we're unable to email them through Campaign (unless we then manually mark as unbounced).

Looking at it, I think maybe with Mailgun, Campaign is being over-keen and marking ANY short term deliverability issue as a permanent bounce.

        if ($eventType == 'bounced' || ($eventType == 'failed' && $severity == 'permanent')) {
            return $this->_callWebhook('bounced', $email);
        }

...I think maybe that needs a bit more subtlety. Indeed I wonder if if should in fact just be the second part there and bounces ignored as many (most, even) are temporary bounces and should not result in permanent 'de-listing'.

Also, there doesn't seem to be a mass way to mark a group of contacts us unbounced. Is that right? Manually entering 489 contacts to edit their status would be a little tedious. I guess I can write some code to do it though, eh?

This also relates to the new 'contacts' tab in the mailing list (thanks for that!). In there, though, I can only see 'subscribed' and 'unsubscribed' - but not the actual status of the contact (enabled/bounced etc) which (seeing it now in action) is actually what I was after....

All of this relates back to us getting complaints from people (long term customers) that they are suddenly not getting our messages. It seems to start with a temporary deliverability issue - basically, there might be an issue with a Mailgun IP being temporarily on some spam cop list or whatever (they fix that and/or move us to another IP) - or it might be a 'mailbox full' or some other temp. server error at the customer end. Any of those things - none of which are really permanent - means that Campaign (not Mailgun) will mark them as permanently bounced. And the only way back is manually unmarking them, it seems.

So really there are two issues - from a mailing list, it is (still) a bit hard to see who will actually get the messages (and handle it there if need be), and more generally the bounce handling is, I think, overly aggressive.

Would love to get your thoughts on this (before I go writing a bunch of code to e.g. periodically un-mark the bounced contacts to try and avoid this happening!).

bencroker commented 3 years ago

Can you check whether any contacts marked as bounced in Campaign do not come up in the list of bounces in your Mailgun dashboard under Sending → Suppressions → Bounces?

The first conditional below (bounced) is there to deal with Mailgun's "legacy" webhooks. Can you confirm whether you are using the legacy webhooks in your domain? The second conditional deals with the newer webhooks and only marks a contact as bounce if they have failed permanently.

        if ($eventType == 'bounced' || ($eventType == 'failed' && $severity == 'permanent')) {
            return $this->_callWebhook('bounced', $email);
        }

Also, there doesn't seem to be a mass way to mark a group of contacts us unbounced. Is that right?

Correct, this is not something that you would normally want to do en mass. The quickest workaround for you might be to do it directly in the database with an SQL query (be sure to test locally before doing this in production and make a backup first).

This also relates to the new 'contacts' tab in the mailing list (thanks for that!). In there, though, I can only see 'subscribed' and 'unsubscribed' - but not the actual status of the contact (enabled/bounced etc) which (seeing it now in action) is actually what I was after.

If the contact has been marked as bounced on the mailing list then that status should appear as follows:

Screenshot 2020-10-07 at 10 35 02

bossanova808 commented 3 years ago

hmmm, yeah, that's precisely the issue - I definitely have a whole bunch marked as bounced in Campaign and not marked as bounced in that area of Mailgun (109 in Mailgun vs. almost 500 bounced in Campaign). I exported ALL my Mailgun bounces to a CSV to double check - definitely just 109.

I am not using any legacy webooks - I just have these:

image

There's definitely something not right somewhere...as way more are being marked bounced than should be. Is there anything I can do to help find out what is going on? Add some logging somewhere?

I suppose it might be the case that for one message these contacts have had a permanent failure (e.g. mailbox full). (It's not so easy to work out as Mailgun only has 5 days of logs on my plan, so I have to catch it in the act and I've only really just noticed this issue).

But (assuming I am correct) - that one failure, whilst permanent for that message, isn't enough for Mailgun to mark the contact as permanently bounced, clearly....which I think is the right approach. But I suppose via webhook you can't tell if the event is one of these scenarios (this message failed, but future ones might not...).

I could certainly write something that checks the mailgun API to see if a contact is on the 'real' bounced list - and if not, unmark that contact as bounced in Campaign (via SQL or a service, whatever) - but it feels a bit kludgy.

(And thanks for pointing out bounced do show in that new view - I checked a few lists but I didn't see any bounces in those particular ones).

bossanova808 commented 3 years ago

(Just for comparison data, also did not have this issue with Mailchip - also very few permanent bounces there - one failed message was not enough cause to permanently mark a contact their either).

bencroker commented 3 years ago

In my understanding of how webhooks work in Mailgun, a failed send should not result in a permanent fail event being sent. I'll look into this further and let you know what I come up with.

bossanova808 commented 3 years ago

No worries. Let me know if I can do anything to help.

bencroker commented 3 years ago

Can you search for "Webhook request:" in your Craft log files and see if any bounces come up?

bossanova808 commented 3 years ago

Looks like I can - will flick them to you on discord as they have customer addresses in them...

bossanova808 commented 3 years ago

So as per discord, we are seeing a fair number of webook requests in our logs....here's one example (sanitised for public display)...

2020-10-05 21:45:06 [-][-][-][warning][Campaign] Webhook request: {"signature": {"timestamp": "1601894705", "token": "xxx", "signature": "xxx"}, "event-data": {"severity": "permanent", "tags": [], "timestamp": 1601894705.087757, "storage": {"url": "https://sw.api.mailgun.net/v3/domains/xxx.com.au/messages/xxx", "key": "xxx"}, "recipient-domain": "iinet.net.au", "id": "xxx", "campaigns": [], "reason": "old", "user-variables": {}, "flags": {"is-routed": false, "is-authenticated": true, "is-system-test": false, "is-test-mode": false}, "log-level": "error", "envelope": {"sender": "xxx@xxx.com.au", "transport": "smtp", "targets": "xxx@iinet.net.au"}, "message": {"headers": {"to": "xxx@iinet.net.au", "message-id": "xxx@swift.generated", "from": "Us  <xxx@xxx.com.au>", "subject": Order # CXXX - Order Documents"}, "attachments": [{"filename": "CXX_INV-13992.pdf", "content-type": "application/pdf", "size": 91051}], "size": 119821}, "recipient": "xxx@iinet.net.au", "event": "failed", "delivery-status": {"attempt-no": 31, "message": "Too old", "code": 602, "description": "", "session-seconds": 0.0}}}
2020-10-05 21:45:06 [-][-][-][info][application] $_GET = [
    'p' => 'actions/campaign/webhook/mailgun'
    'key' => '••••••••••••••••'
]

...some are legit bounces (people who typed emails wrongly etc) - but some are 'too old' which is a generic deliverability type message (according to Mailgun) - and when manually emailed their email is ok (I've checked). Ultimately it probably msotly comes down to Mailgun shared IPs and them not managing their deliverability enough (am considering moving providers as another part of solving this, but overall they've been ok so not sure on that yet). Perhaps worth noting that we have set up SPF/DKIM etc., have double opt-in for our newsletter, and done literally all we can to make sure deliverability is as high as possible.

But many in the logs are definitely classic 'one-time' failures - and mostly they are not even related to Campaign messages, but are the results of other transactional emails - but they are reported by Mailgun back to our domain as failed severity permanent as can be seen above...and this triggers Campaign to mark them as a permanent bounce.

In effect I think this means that Campaign really is marking as bounced a whole lot of contacts (again, we're running at about 500 in Campaign marked as bounced across a few months, vs. 110 or so in Mailgun itself after several years)....that really should not be marked as bounced. I think that you would want to do a follow up check in the code here, and perhaps hit the Mailgun API to actually check if they have been put on their permanent suppression list, and only then mark them as bounced.

(Thinking about it more generally, with a transport like Mailgun that is already handling bounce/complaints at their end, I am not even sure I see much point in bounce handling in Campaign at all...not really sure it adds much. I suppose it is useful to know for a particular Campaign that a message did not get through to such and such a contact...but other than that, Mailgun is already preventing emails to complainers (which we've had 4 of in like 15+ years and fairly sure almost all have been accidental) and real, permanent fails...so that seems enough really?)

So in the short term I have two things to do I guess:

  1. Unmark the 500 or so that have been marked as bounced - you suggested SQL. which should not be hard.
  2. Work out how to stop this happening in future (in the short term I could simply disable the webhooks in Mailgun I suppose, so these don't get posted back?)

Thanks for looking at this!

bossanova808 commented 3 years ago

Further, have confirmed with some of these people that the emails Mailgun is labelling as 'too old' have in fact got through to them. I think it's some sort of anti-spam thing where the recipient's server does accept the email, but does not report that back to Mailgun, perhaps. In any case, definitely cases where they should not be marked as bounced...

bossanova808 commented 3 years ago

RE: the SQL it looks like just a very simple:

UPDATE craft_campaign_contacts set bounced = NULL;

...would just clear these out. I would probably do this before our next newsletter,at least, as otherwise a lot of folks won't be getting their messages.

Does that SQL seem ok? Any side effects I need to worry about?

bencroker commented 3 years ago

Thanks for following up. I will look into whether it is possible to reliably identify permanent failures in Mailgun as hard bounces.

in the short term I could simply disable the webhooks in Mailgun I suppose, so these don't get posted back

Yes you could disable the Mailgun webhooks altogether as Mailgun will never allow sending to hard bounced email addresses.

You'll want to execute a similar SQL query on the campaign_contacts_mailinglists table, see the subscriptionStatus and bounced columns.

bossanova808 commented 3 years ago

Ok, so I have disabled webhooks and this is the clean-up SQL I have come up with:

UPDATE craft_campaign_contacts 
set bounced = NULL;

UPDATE craft_campaign_contacts_mailinglists 
SET subscriptionStatus = 'subscribed', bounced = NULL 
WHERE subscriptionStatus = 'bounced';

Seems to work fine on development so unless you think that looks wrong, I will run it on production as well shortly.

Obviously it would be nice, longer term, if there was a more 'proper' fix here, as ideally we'd still like to see legitimate hard bounces in our results.

bencroker commented 3 years ago

The SQL queries look good.

Yes, we'll be looking into a better fix for this.

bossanova808 commented 3 years ago

Any progress on this Ben?

It's not great at the moment - we're essentially relying on Mailgun's suppression list for list control of bounces (and complaints, although we don't really get those) - which means our Campaign list increasingly gets a little more divorced from reality with each newsletter.

Be great to see a fix for this...

bencroker commented 3 years ago

Not much, but while scouring the docs, I did notice that a permanent failure can be caused for one of several reasons: a hard bounce, repeated soft bounces or a spam complaint.

Permanent Failure Webhook

There are a few reasons why Mailgun needs to stop attempting to deliver messages and drop them. The most common reason is that Mailgun received a Hard bounce or repeatedly received Soft bounces and continuing attempting to deliver may hurt your reputation with the receiving ESP. Also, if the address is on one of the ‘do not send lists’ because that recipient had previously bounced, unsubscribed, or complained of spam, we will not attempt delivery and drop the message. If one of these events occur we will POST the following webhooks payload to your permanent_fail URLs. You can specify webhook URLs programmatically using the Webhooks API. Ideally, the next time you find a permanent failure request to the webhook that you can see is not in Mailgun's bounce logs you can ask Mailgun support why that is the case.

Source: https://documentation.mailgun.com/en/latest/user_manual.html#tracking-failures

In theory, we could use the Bounces API to manually fetch bounced email addresses and match them to contacts in the plugin, but this would require a cron job to do this on a recurring basis. The webhook approach is the simpler, more direct approach to detecting permanent failures (regardless of Mailgun's specific definition of them). If you're unsatisfied with the result of the webhooks then you can either disable them and let Mailgun handle determining which emails addresses not to send to, or you could write a module to fetch and update bounced contacts from the Bounces API. If you need help with this then we can provide a quote.

bossanova808 commented 3 years ago

I could write a module for myself, sure, but really anyone using Mailgun is going to want this, aren't they?

The basic problem here, it seems, having dug into it, is sender reputation related. Mailgun, like all of these services at the more affordable levels, uses shared IPs for sending the mail. Unfortunately, as I have found over the last 3 or 4 years with them, it is not uncommon that one of their IPs gets blacklisted by SpamCop and similar services due to some other user of that IP being 'noisy' as Mailgun put it. They will change you to another IP, if you manually request such, but it's clear they are at best sluggish about managing this in general. So there are thus unpredictable periods where deliverability drops for a while. But these are not permanent failures - they ARE permanent for that message of course - but when they change your IP and you send another email to this address, it goes through fine, as it should. But in the meantime, Campaign will have marked the contact as a permanent bounce and won't send anything else from them on to them - they've effectively been removed from the list for no reason other than some other user of that IP being, possibly, a spammer.

This becomes, then, a thorny problem - these perfectly valid emails, that are affected by this temporary issue - are marked as permanently bounced in Campaign...when they really shouldn't be given this bigger picture. As you say once can let Mailgun handle it, but this does then result in the hard bounces NOT being recorded to the Campaign list, which gives one less opportunity to deal with such issues.

I have come to the conclusion that Mailgun is not very good at this and will personally likely look to another service, but Mailgun is very popular. And personally I don't see how Campaign can effectively work with lists and Mailgun, without having this double check against 'real' the hard bounced list. Again - we were seeing hundreds of these 'false bounces' over time...vs a very low number of real, permanent bounces. This is with very standard use - basically a once a month newsletter to about 9000 people.

Anyway, if you're definitely not going to add it, I'll either move service, or write the module, but without this check against hard bounces I don't think Campaign can reasonably claim to properly support bounce or complaint handling with Mailgun, personally. I don't mean that as criticism, and completely get why you wrote it the way you have, as this is not wonderfully clear/great form Mailgun's end (as they are effectively translating a temporary delivery issues into a message about a permanent contact failure) - but the net effect is pretty broken handling of temporary bounces against contacts, which just creates inaccuracies and a bit of a mess.

bencroker commented 3 years ago

I could write a module for myself, sure, but really anyone using Mailgun is going to want this, aren't they?

Possibly, but the same could then be said of the other webhooks that Campaign provides (Amazon SES, Mandrill, Postmark, SendGrid). Adding an extra layer of complexity to double check the results of the webhook requests for each of these services is far from ideal.

I agree that this is a thorny problem, but maintain that the issue is on Mailgun's end. In their Guide To Webhooks they write:

Webhooks are an extremely flexible way for developers to monitor the health of their email messaging in real time, analyze deliverability data and program apps to handle unsubscribes, spam reports and bounces instantly.

In the user guide they write:

Hard bounces (permanent failure): Recipient is not found and the recipient email server specifies the recipient does not exist. Mailgun stops attempting delivery to invalid recipients after one Hard Bounce.

So everything that they state implies that the Campaign plugin is handling things correctly. Campaign only treats "permanent" failures as bounces, not "temporary" failures, so again if Mailgun is reporting permanent failures when it shouldn't be, then the issue is on their end.

Have you contacted Mailgun support about this? It would be interesting to hear what they have to say and might help to confirm/debunk your theory.

bossanova808 commented 3 years ago

So I read that and agreed with all of it. But then I thought I'd just read through the Mailgun docs to be more sure, and the way I read it is Mailgun is, in fact, doing what it says it will, but the issue is really of interpreting the messages it is sending.

Here is what I find the most relevant section: (also from https://documentation.mailgun.com/en/latest/user_manual.html#tracking-failures)

With respect to failure persistence Mailgun classifies bounces into the following two groups:

Hard bounces (permanent failure): Recipient is not found and the recipient email server specifies the recipient does not exist. Mailgun stops attempting delivery to invalid recipients after one Hard Bounce. These addresses are added to the “Bounces” table in the Suppressions tab of your Control Panel and Mailgun will not attempt delivery in the future.
Soft bounces (temporary failure): Email is not delivered because the mailbox is full or for other reasons. These addresses are not added to the “Bounces” table in the Suppressions tab.

Mailgun, with its permanent failure webhook, is sending a message about a permanent failure of that specific message - it is Campaign that is then making a decision to translate this message, about just that one message, into a permanently bounced (suppressed) contact, and blocking all future emails to that contact - based on, what is clearly quite possibly just a temporary failure. It's really the distinction between a single message level (temporary) problem and a (permanent) contact level problem that is being lost with Campaign's current approach.

From that quote above, it is clear Mailgun recognise this issue themselves (the possibility of one-off soft bounces for a variety of reasons) and therefore do not add these contacts to their permanent bounce list - unless its a true hard bounce. But they are rightly still alerting that the message in question has permanently failed to be delivered on this occasion.

Thus I believe that Campaign should use the permanent failure message in its reporting about a send-out, for sure, but that before marking the contact as a bounced, Campaign should double check it was really a hard bounce that would affect future deliverability.

Of course, I could be entirely wrong, it's happened before, but that's how I am reading their info. I can certainly try and get their input on this too, of course, if you feel I am misinterpreting their info.

bencroker commented 3 years ago

From that quote above, it is clear Mailgun recognise this issue themselves (the possibility of one-off soft bounces for a variety of reasons) and therefore do not add these contacts to their permanent bounce list - unless its a true hard bounce. But they are rightly still alerting that the message in question has permanently failed to be delivered on this occasion.

Can you help me understand how you came to the conclusion that Mailgun sends a permanent failure in the case of a soft bounce? The way I read it, hard bounce → permanent failure, soft bounce → temporary failure.

bossanova808 commented 3 years ago

I think it's the repeated soft bounce scenario, mentioned here:

Permanent Failure Webhook

There are a few reasons why Mailgun needs to stop attempting to deliver messages and drop them. The most common reason is that Mailgun received a Hard bounce or repeatedly received Soft bounces and continuing attempting to deliver may hurt your reputation with the receiving ESP. Also, if the address is on one of the ‘do not send lists’ because that recipient had previously bounced, unsubscribed, or complained of spam, we will not attempt delivery and drop the message. If one of these events occur we will POST the following webhooks payload to your permanent_fail URLs. You can specify webhook URLs programmatically using the Webhooks API.

...but even repeated soft bounces is a message level event, not one that means there will never be an opportunity to deliver to this address again. Hence Mailgun itself not adding this to their permanent uppression list..but that implies, right, that they will send to the permanent failure hook in this case?

And that's precisely the behaviour we saw...

(I'm not at all saying I love what Mailgun is doing here, or how they do it, just that...they do say (albeit in a somewhat buried way) - this is what they will do, so without getting them to change the behaviour, this will remain a problem for Campaign's bounce handling with Mailgun...)

bencroker commented 3 years ago

Ironically, the deprecated legacy webhooks API has a bounce webhook, although the documentation for how that works in practice is even slimmer. https://documentation.mailgun.com/en/latest/api-webhooks-deprecated.html

So it seems that the most reliable method of marking contacts in Campaign as bounced is to periodically hit the Bounces API and compare its results against the contacts, in addition to using the webhooks. The webhooks provide the plugin with the context of which sendout actually caused the bounce, which is useful information to have. In saying that, I am not convinced that this should be added directly to the Campaign plugin. Mailgun is just one of 5 email service webhooks that Campaign supports. Adding this extra layer of complexity for up to 5 unique services is a lot of technical debt, especially considering how APIs tend to get deprecated (as the legacy Mailgun API was).

My recommendation, if you're going to stick with Mailgun and want to manage bounces within the plugin, is to build a plugin to access the Bounces API, which as I said before we can help with. Thanks for the discussion and sorry that there is no quick and easy solution for this.

bossanova808 commented 3 years ago

But it would only be Mailgun that has this particular issue, right? Not all 5 services? Or have you come to the conclusion they all behave in the same way in this regard - I think that is probably not right.

I think then, at the least, you'd have to note in your docs, that with Mailgun specifically, Campaign can't really handle bounces in an accurate way. (And if all 5 services do have this same issue, then I think you'd have to note that it can't handle bounces properly probably with any of them, to be honest. But that seems unlikely.)

I can see why you don't want the technical debt, but the reality is without this extra layer of handling, Campaign is not able to accurately deal with bounces with Mailgun, so I think in fairness folks need to know that. There is demonstrable risk of hundreds of contacts - and it would be more over time, and an ongoing issue - being inaccurately suppressed otherwise, and I think that's a fairly big sized bug/lack of accuracy/lack of functionality/whatever you want to term it - that folks need some pre-warning about, for this combination of Campaign with Mailgun.

I will most likely move services at some point next year or implement the cron/api hit myself, which does not look hard. Thanks for looking at it all.

bencroker commented 3 years ago

I did some further digging and found that there is a reason property that can be used to determine whether Mailgun added an email address to its bounce suppression list:

"event": "failed",
"severity": "permanent",
"reason": "suppress-bounce",

---

"event": "failed",
"severity": "permanent",
"reason": "bounce",

Source: https://documentation.mailgun.com/en/latest/api-events.html#event-structure

Adding a condition to ensure that a contact is marked as bounced only if the reason is one of the above, should hopefully resolve this issue. What do you think?

bossanova808 commented 3 years ago

I think that sounds great - hopefully it means what it looks like it means. Will be very happy to test it out!

bencroker commented 3 years ago

Added the condition and released in 1.17.3.

kehers commented 2 years ago

Just to add that there is also reason: old. This happens when email cannot be delivered after 8 hours. It should still be treated as a non permanent bounce though.

I did some further digging and found that there is a reason property that can be used to determine whether Mailgun added an email address to its bounce suppression list:

"event": "failed",
"severity": "permanent",
"reason": "suppress-bounce",

---

"event": "failed",
"severity": "permanent",
"reason": "bounce",

Source: https://documentation.mailgun.com/en/latest/api-events.html#event-structure

Adding a condition to ensure that a contact is marked as bounced only if the reason is one of the above, should hopefully resolve this issue. What do you think?

bencroker commented 2 years ago

@kehers so are you suggesting that a "permanent" bounce might not be considered a hard bounce in some cases? Can you please provide an example?

kehers commented 2 years ago

@kehers so are you suggesting that a "permanent" bounce might not be considered a hard bounce in some cases? Can you please provide an example?

Yes. (Context: I am the founder of a marketing automation solution and we use Mailgun as our delivery partner). Certain email servers, Yahoo especially, throttle deliveries when multiple inbound is detected from the same IP. When this happens, Mailgun sends a "temporary" severity bounce. Mailgun will continue to retry over a period of time. If it can't deliver after 8 hours. The email will permanently fail with severity: permanent and reason: old.

bencroker commented 2 years ago

Ok, it is unfortunate that this is undocumented, but thanks for bringing to our attention.

bencroker commented 2 years ago

I just checked the code and we already check for one of those 2 reasons before marking a contact as bounced.

https://github.com/putyourlightson/craft-campaign/blob/413c179be3048a104cbd99461439f1e78b7f5b6c/src/controllers/WebhookController.php#L173-L179

bossanova808 commented 2 years ago

(I'm un-follwoing here as we moved on to Campaign + Postmark - where we're not experiencing such issues).