Open TysonAndre opened 4 years ago
I have the final draft here https://misha.nasledov.com/uniqush.html
From a quick glance, it looks the same
The only change I could suggest is APNS -> APNs, per Apple's own documentation. I could probably link to that - the image links link (singular) is broken for https://misha.nasledov.com/uniqush.html for me as part of the blog no longer existing, though
<p class="block-img"><img src="https://d3gqbr1mr54afg.cloudfront.net/ifwe/0d1608dff6d5caf7dcd7bb4b44c45fc171a3d030_screen-shot-2016-04-27-at-5.49.52-pm.png" alt="" width="669" height="501" /></p>
I have an old draft (version 2) from @mishan - the original blog the blog post linked to no longer exists. Incompletely converted from ODT to markdown with pandoc below
How We Solved Push Notifications at if(we)
By Misha Nasledov (@mishan)
The Early Days of Push
In the early days of mobile app development, back when the first versions of the Tagged mobile application were being developed, a very scrappy mobile push notification system was put together. The original push code was written in PHP without using any sort of library. It supported GCM (Google Cloud Messaging), C2DM (prior to GCM’s existence), and APNs (Apple Push Notification Service). We had a very lame subscriber database -- only the most recently used device would receive push notifications. We did not handle every exception from the service properly, such as an unsubscribe after uninstalling the app, or follow certain best practices. In particular, APNs is a bit more involved to use as it requires calling a feedback service to get the result of the push.*1
*1 = Latest APNs HTTP/2 spec obviates this.
The Search For A Better Way
We looked at various solutions as we wanted to revamp our push notification system in order to get more out of it. We decided the best place to start was to actually improve the push notification engine and the interface to it. Not being particularly married to our in-house GCM and APNs push code, we looked at various, alternative, off-the-shelf solutions in lieu of trying to improve the old system.
We wanted a system that would let us better abstract away different push service provider APIs. The ability to push to more than one device per user was also something we desired. The PHP push code gave us enough trouble with the lack of persistent sockets -- there was already a lot of opening and closing of connections with APNs, and sending more notifications per person meant even more connection churn. The new system needed to use sockets efficiently, and handle errors more gracefully.
We didn’t particularly want to go with some vendor for sending push notifications. When dealing with users’ person data, having one less party involved keeps our users’ information safer. A third party service would not have offered us as tightly integrated control and flexibility. We also had plenty of spare servers to run a service.
Uniqush
While researching potential solutions, we discovered an open source project known as Uniqush. It was open source, it had some users, and the source looked relatively simple enough that we could give it a shot and work with it. The only dependency was a persistent Redis server, which we had already set up for another, unrelated project previously before considering Uniqush. It’s noteworthy that the project was structured so that one could write a different database module so that an RDBMS such as MySQL or PostgreSQL could be used, but currently only Redis is supported.
In a nutshell, Uniqush keeps information about “push service providers” (PSPs). PSPs are the push notification endpoints (e.g.: our Tagged mobile application GCM endpoint). Service names identify sets of one or more PSPs, each PSP having a unique push service type. We make these one to one so we can work on pushes for apps independently. Uniqush also keeps information about subscribers, which are associated with sets of one or more devices registered with a service endpoint, in Redis. In order for this to be useful, one has to set up their Redis server for persistence. So long as all one’s data can all be kept in RAM, persistence is pretty easy. We have millions of mobile users, many of whom have more than one device, and our subscriber database is (relatively speaking) pretty small -- about 15GB. This also solved our shortcomings with our subscriber database without us having to write a new way to store subscribers.
Uniqush supports the services we use (GCM and APNs) as well as ADM (Amazon Device Messaging.) The one shortcoming the project had was that it did not support passing JSON payloads directly through, but instead constructed the payloads from passed-in parameters. This was an issue as we pass custom push notification payloads to our clients that contain data about alert counters and, for Android, a profile picture URL. Changing the way the client processes the notifications would break older versions of clients. We ended up changing the code that constructs the payloads and created a way to pass raw JSON payloads (intended for a specific device type) directly to Uniqush.
Giving Uniqush a Shot
About a year ago we first put Uniqush on a couple of VMs on production and changed our PHP push code to try sending through Uniqush when an experiment was enabled. If the service call to Uniqush failed, it would fall through to the old implementation, just in case. We first tried using Uniqush to send our GCM push notifications and it ended up working mostly without trouble, sending about 250 push notifications per second. There were a couple of small bugs that became evident once Uniqush was running at production load, but they were easily fixed.
APNs proved to be a bit trickier. There’s more complexity to the protocol, requiring asynchronous writes and reads on TCP sockets, having to track 32-bit identifier, and the fact that Apple will close the socket immediately instead of giving an error code when a push fails. Uniqush’s APNs module turned out to not have a very reliable implementation and unfortunately fell over at production load. However, due to the pros of Uniqush, success with GCM, and overall simplicity of the code, we kept investing in the project. We rewrote the APNs module to use a worker pool implementation that didn’t have the race conditions of the existing implementation.
Scaling
Currently we use Uniqush to send all of the mobile application GCM and APNs push notifications for Tagged and hi5 at if(we). That’s about 400-500 notifications per second. Because it’s a standalone service that has no internal knowledge of the Tagged application or any other business logic, we can easily use it for other apps we develop.
To reach this scale, we have four 4-core 4GB hosts running the uniqush-push instances. We currently run three Uniqush processes per VM, though, in reality, the tier is a bit over-provisioned to handle growth and any surge of activity. The Uniqush instances actually end up taking a lot more queries than just the 400-500 notifications per second. We query the Uniqush subscriber database bwefore sending a push notification so that we can make more intelligent decisions about whether to push to a subscriber. The mobile clients, in aggregate, send about another 500 subscriptions per second. Overall, the tier is handling something around 1500 queries per second.
All of these queries end up hitting Redis to obtain, modify, and/or add subscriber information. Before embarking on this project, we had already built a large, general purpose persistent Redis “cluster.” It is not actually a Redis Cluster but, rather, it is a cluster of Redis shards with consistent hashing. Uniqush uses our fork of Twitter’s twemproxy in order to be able to utilize the cluster. Our fork contains a yet-to-be-merged patch by @andyqzb to add Redis Sentinel support so that failovers can be handled properly. We have two 32-core 256GB hosts to run the Redis master and slave shards.
What’s Next?
We’ve contributed fixes and improvements we’ve made to the Uniqush project back upstream and continue to make improvements and contributions to the project. The ability to store other data with individual subscriber devices such as client versions and subscription dates has been developed but hasn’t been pushed back upstream yet as we haven’t even really started using these attributes ourselves. It will allow for much more intelligent application logic -- for instance, we could send some kind of new push notification only to the devices of subscribers with the latest application version on their device. Our fork which may have experimental features under development that have not been pushed upstream yet is located at http://github.com/ifwe/uniqush-push
Uniqush has been a resounding success at if(we). A few months ago we finally ripped out the old push notification code from our PHP (web) codebase. Uniqush was sending all of our APNs and GCM push notifications at full production load without issue. It made everything much simpler. The concern of implementing and maintaining the APNs and GCM implementations is gone. All our PHP code has to do now is deal with constructing push notifications (more specifically, the content of the notifications and any application-specific log) and relaying them to Uniqush as well as telling Uniqush to subscribe and unsubscribe devices of users. Uniqush takes care of maintaining the subscriber database, handling errors / exceptions, and actually sending the push notifications to Apple and Google’s servers. This ability to operate at a more abstract level has made it easy for us to then focus on things like creating an A/B experiment framework for push notification content and scheduling, smarter push notification scheduling, and more intelligent device routing for push notifications.
Acknowledgments
Thank you Nan Deng (@monnand) for creating Uniqush! It ended up working quite well at if(we). And a big shout-out to our colleague Tyson Andre (@TysonAndre) for making and driving many improvements to Uniqush.