hippware / wocky

Server side (Erlang/Elixir) software
13 stars 3 forks source link

SNS registration errors in log files #736

Closed toland closed 7 years ago

toland commented 7 years ago

We are seeing repeated registration errors for a handful of users. It looks like it is the same few users who regularly get the error. And it happens after clearing the registrations.

2017-06-01 19:21:02.384 [warning] <0.5584.0>@mod_wocky_notifications:make_error_response:72 Error on notification IQ request: {xmlel,<<"error">>,[{<<"code">>,<<"503">>},{<<"type">>,<<"cancel">>}],[{xmlel,<<"service-unavailable">>,[{<<"xmlns">>,<<"urn:ietf:params:xml:ns:xmpp-stanzas">>}],[]},{xmlel,<<"text">>,[{<<"xmlns">>,<<"urn:ietf:params:xml:ns:xmpp-stanzas">>}],[{xmlcdata,<<"service unavailable">>}]}]}
2017-06-01 19:21:02.384 [error] <0.5584.0>@Elixir.Wocky.PushNotifier.SNSNotifier:handle_aws_error/1:82 SNS API error (400): %{code: "InvalidParameter", detail: "", message: "Invalid parameter: Token Reason: Endpoint arn:aws:sns:us-east-1:773488857071:endpoint/APNS/tinyrobot_staging/ca5edb30-bd37-3bf6-8a4a-8734a51a1a58 already exists with the same Token, but different attributes.", request_id: "ea7b9093-7e7b-5efb-81cb-c1391525d908", type: "Sender"}
toland commented 7 years ago

That endpoint (arn:aws:sns:us-east-1:773488857071:endpoint/APNS/tinyrobot_staging/ca5edb30-bd37-3bf6-8a4a-8734a51a1a58) is for becky.

bengtan commented 7 years ago

This potentially overlaps with #694.

toland commented 7 years ago

I don't think this is related to #694. I thought it might be at first, but now I know what was causing that issue and the two are orthogonal.

Creating endpoints is an idempotent action in SNS. We should be able to create the same endpoint over and over again as long as we provide the same attributes. The error here indicates that we are not providing the same attributes on a subsequent create call. We only provide two attributes when creating an endpoint: the device ID and "user data". In this case the user data is the full JID of the user (for reference purposes - AWS doesn't use the user data for anything).

The JID must be the thing that is different, but I can't figure out how or why. I have added a log statement in PR #739 that should help to shed some light on the issue.

bengtan commented 7 years ago

There's an 'enabled' column in the SNS gui. Is there a corresponding 'enabled' attribute we can supply?

Maybe we weren't supplying so it just used the default value (whatever that is), but the error occurs when the 'enabled' attribute has changed to a non-default value?

[speculation]

toland commented 7 years ago

You cannot set Enabled when creating an endpoint. I assume the reasoning is that you would never need to create a disabled endpoint. The only attributes set at creation are the token and the user data. The token is the "primary key," so that isn't the thing that is changing.

I have added some logging that should help nail down this issue. It will log all registration attempts in notifications.log with the parameters sent. Without that information, we are just speculating.

toland commented 7 years ago

I did some testing in the console, and it appears that the only way to get this error during registration is to send a different string in the user_data field. Since we are simply including the JID in that field, that suggests that we may be seeing multiple resource names for the same device token (since the user ID and server should remain constant).

A simple fix would be to not send anything in the user_data field. I wonder if this would just cover up a problem rather than fix it?

I think that the SNS bookkeeping is becoming more trouble than it is worth. The ultimate solution may be to send the notifications ourselves and cut out SNS as a middleman.

bengtan commented 7 years ago

Since we are simply including the JID in that field, that suggests that we may be seeing multiple resource names for the same device token (since the user ID and server should remain constant).

Including the bare JID or the full JID? I presume full JID.

While the 'multiple resource names' bug did exist on Staging for a while, it never manifested (ie. in the device table). I was keeping close watch on it throughout it's life cycle there.

The evidence is not reconcilable at the moment.

A simple fix would be to not send anything in the user_data field. I wonder if this would just cover up a problem rather than fix it?

Happy to try it, but ... this would mean you'd have to clear out all the existing SNS entries? All the existing entries have a non-empty user_data. If the server changes to send empty user_data, that's definitely a different value for user_data.

Definitely needs testing on Staging first.

I think that the SNS bookkeeping is becoming more trouble than it is worth. The ultimate solution may be to send the notifications ourselves and cut out SNS as a middleman.

Happy to consider this too.

Or we could consider Google Firebase. It also allows sending push notifications to both Android and iOS.

toland commented 7 years ago

Including the bare JID or the full JID? I presume full JID.

Full JID.

The evidence is not reconcilable at the moment.

I agree. That is why I am stumped ;)

Happy to try it, but ... this would mean you'd have to clear out all the existing SNS entries?

Yes.

Definitely needs testing on Staging first.

Agreed.

Or we could consider Google Firebase.

I am not sure that we wouldn't just be trading one set of problems for another. Elixir doesn't have a great Firebase library, whereas it does have good APNS/GCM libraries (like Pushex). At this point, I am not sold on needing a middleman to manage notifications.

Happy to consider this too.

I think Pushex is the way to go. So far, most of our problems with push notifications have been related to bookkeeping in SNS. I am going to take a whack at replacing our SNS usage with Pushex.

bengtan commented 7 years ago

I am going to take a whack at replacing our SNS usage with Pushex.

Go ahead.

but ...

I am not sold on needing a middleman to manage notifications.

(Reminder: I think we're using SNS for sms notifications to yourself and Bernard too? )

bernardd commented 7 years ago

(Reminder: I think we're using SNS for sms notifications to yourself and Bernard too? )

We are, but other than both systems using SNS, they're entirely separate. Ditching SNS for the app notifications would have no effect on our use of it for system monitoring.

toland commented 7 years ago

This is made moot by #767