Invalid tokens, IO.select and unsent Notifications

scottburton11 commented 10 years ago

We're evaluating houston after inheriting a project that heavily uses grocer.

One of the persistent problems the project had with grocer was that failed notifications caused persistent connections to drop subsequent notifications, silently.

Pardon me if I'm retreading old ground here, but for reference, see https://github.com/grocer/grocer/issues/14 for discussion around grocer and this topic. In a nutshell, the APNS architecture notifies of failures on the push socket, but not of successes, so you're required to either block the connection until the failure notification appears, or a timeout has been exceeded. The APNS feedback service can be used to determine, after the fact, which device tokens were bad, but it will not indicate which messages have failed, so re-sending subsequent failures is complex.

It looks like the changes around e71b87dadc6c90eef92b463020e2e991d27f0700 address this issue by implementing the IO.select solution mentioned in the above issue, and implemented in a few forks (https://github.com/ping4/grocer/blob/ping4/lib/grocer/ssl_connection.rb has an example).

I'm new to IO.select, but as far as I can tell, this;

ssl = connection.ssl
...
read_socket, write_socket = IO.select([ssl], [ssl], [ssl], nil)

will block until either read_socket is ready for reading or write_socket is ready for writing. So it's possible that IO.select will return before the APNS has pushed an error code, indicating a false positive. In informal testing, this seems to be exactly what happens:

    note = Houston::Notification.new(token: "8762d160 3dad3f1b ccd84855 7576bc66 784b9a61 oh who am I kidding I'm totally invalid", alert: "You'll never see me")

    connection.write note.message
    ssl = connection.ssl

    read_socket, write_socket = IO.select([ssl], [ssl], [ssl], nil)
    if read_socket && read_socket[0]
      response = connection.read(6).unpack("ccN")
      puts "Response: 2 #{response}"
    end

The error never appears: IO.select returned because write_socket became ready for writing, not because read_socket became ready for reading.

Without the second argument, IO.select will block (forever) until the read_socket is ready with an error to read. It does, however, reliably contain error messages for failed device_tokens.

Without the second argument, and some small-ish value for timeout (the fourth arg) accomplishes both goals, at the expense of waiting for timeout seconds to pass before the next message can send:

read_socket, write_socket = IO.select([ssl], [], [ssl], 0.1)

The timeout tax might work in certain situations, so I'm not willing to discard it.

That said, I'm already out of my depth. My limited research indicates that this is the accepted solution area, but grocer has mostly abandoned this idea, and nobody is happy about the blocking. Is there a solution to this that doesn't involve opening and closing connections repeatedly?

kaneda commented 10 years ago

:+1: to this. I ran into an issue where device tokens became stale, at which point if I am sending multiple notifications simultaneously all of the subsequent notifications are dropped silently due to the stale token(s).

micred commented 10 years ago

Same problem here! How should this situation be managed? Now I send notification individually, as a temporary fix.

kaneda commented 10 years ago

@micred I did the same thing, which isn't desirable due to the additional overhead. I'd like a stale token return so I can return that error the client to force a token refresh or remove the sender from the list.

I also don't like that APN itself simply stops subsequent execution, I wish it would finish the job and return the errors.

jchambers commented 9 years ago

Hi! I'm the maintainer of Pushy, a Java APNs library. I realize this is something of an old issue and the specifics of our solutions may not apply directly, but I wanted to share what we've done in this area.

There really isn't any combination of waiting or blocking or reading you can do to guarantee that a notification has been sent and not rejected. Even if you attempt a blocking read after a write, there's absolutely no guarantee as to when the APNs gateway will get around to sending a "bad push notification" message down the line. Getting no data when you attempt to read doesn't necessarily mean the notification wasn't rejected; it may just mean that it hasn't been rejected yet.

The only time we know anything about the state of any notification is when we get a definitive "bad push notification" response from the APNs gateway. At that point, we know that everything sent before the rejected notification was accepted (TCP guarantees that everything happens in order), one specific notification was rejected and should not be resent, and everything sent after the rejected notification was dropped by the gateway and needs to be re-sent.

In the end, the design of APNs means that there are some problems we simply cannot avoid. Notifications winding up in an unknown state after a connection dies is one of them.

But! We can definitely make a best effort, and that brings us back to the issue at hand. I think the best thing you can do here is to just continue writing notifications until you want to close your connection (how and when is up to you), but keep a local copy of everything you send. Occasionally, you'll find out that something you sent a while ago got rejected, but you'll be able to re-send all of the notifications that followed.

We don't have a perfect solution to this problem, but in Pushy, we take a number of small steps to keep things as robust as possible:

We keep a buffer of all sent notifications in memory. If we've sent 30 notifications and notification number 17 gets rejected, we know we need to send notifications 18-30 again.
We shut down connections by sending a deliberately-malformed notification to the APNs gateway. At that point, we know that prior notifications were successful. If we send a "poisoned" notification to shut down and get no response, we know the connection is (was?) dead, but what that means for other notifications is unclear. If we DO get a response, we know that everything else succeeded and we can clear our sent notification buffer.
We provide a number of ways to periodically close connections (after a period of inactivity or after N notifications have been sent) with known-bad notifications to get into a known state frequently. This reduces the probability of notifications flying off into limbo, but doesn't eliminate the possibility.

Does that make sense?

svewag commented 9 years ago

I also had an issue with invalid tokens and found a solution for that.

Just like Apple said:

If you send a notification that is accepted by APNs, nothing is returned. If you send a notification that is malformed or otherwise unintelligible, APNs returns an error-response packet and closes the connection. Any notifications that you sent after the malformed notification using the same connection are discarded, and must be resent.

https://developer.apple.com/library/ios/documentation/NetworkingInternet/Conceptual/RemoteNotificationsPG/Chapters/CommunicatingWIthAPS.html#//apple_ref/doc/uid/TP40008194-CH101-SW1

So after digging in the code, I found that houston already checks for an error block after writing to the socket (see https://github.com/nomad/houston/blob/master/lib/houston/client.rb#L53-L60)

As you can see the notification is mark as unsent when an error occurs. That's the point where you have to reopen your connection to be able to proceed with your batch to send the the next notifications.

Does that make sense?

UPDATE: I noticed that the check of the read_socket is broken. If you skip this check and continue with connection.read(6) the error handling works. https://github.com/nomad/houston/blob/master/lib/houston/client.rb#L53-L54

guyisra commented 9 years ago

@svewag you tried it with a fork? Is the error returned immediately or eventually?

svewag commented 9 years ago

No, I didnt forked.

In my project we have good experienced so far with the workaround to check after every 20th sent push notification if there is an error on the read-socket. So after the 20th sent notification, we wait 1 second and the read with connection.read(6) to obtain a possible error message from the service. You also get the id of the failed notification.

Because we are iterating through an array of notification to be sent we can determine by the index of the failed notification which notifications has been sent successully and which not. Your next step then is to reconnect to the APN and resend the notifications in the array after the failed one, and so and so on.

kbrock commented 9 years ago

please be aware, that select doesn't always detect an issue. You really need to write to the socket to fully detect an issue.

Or pause and resend. And pausing a simple second (or half) seemed to catch it most of the time, it just caused too much of a lag for us. If I remember correctly, adding 1 second every 20 messages makes the messages go through in double or tripple the time. Some other people would send a hundred messages (or till the end of the message list) and then intentionally send a bad message and read the response. If the error code was for the bad message sent, then they knew definitively that it was a success.

Best of luck.

nomad-cli / houston

Invalid tokens, IO.select and unsent Notifications #72