mozilla-services / autopush

Python Web Push Server used by Mozilla
https://autopush.readthedocs.io/
Mozilla Public License 2.0
215 stars 34 forks source link

Is it possible to use consist hashing for finding the CEP (communication end point)? #1391

Closed GingerMoon closed 4 years ago

GingerMoon commented 4 years ago

Now, every time we push a message, we need to query DB to know which CEP(Communication End Point) to use to push the message to the deivce. Please correct me if my understanding is wrong.

I am thinking whether we can use UAID partitioning and consistent hashing so that we don't need to query DB every time we push a message. For example, when a deveice's CEP is down, we can use the next CEP in the hash ring.

Any insights would be much apprecaited!

jrconlin commented 4 years ago

We built the push system to have two basic ideas: 1) We wanted to convey the least amount of info we could about the remote device 2) We needed to be able to route the message to the remote device.

Since the servers handling the websocket connections could only handle several hundred thousand connections, we had to have a "routing system" to be able to find which node a given device was connected to. Since that "route" could be created or expire at any time, it didn't make sense to send the routing info to all the other nodes, but instead have it stored in one, central database.

The steps we imagined were:

If a node ever fails to deliver to a UAID, (ping timeout, websocket failure, etc.) it clears the nodeID from the routing table info.

We also have to handle "bridged" mobile devices, which have their own set of data (e.g. APNs, FCM and ADM uses a specific device token which is similar to our UAID) This data would also need to be stored in some common database and looked up for message delivery.

I suppose you could create a hashed UAID component that possibly might route a message to a node, but I'm not sure that would work the way you want. If a user disconnects and reconnects to a different node, the UAID would change and all their previous endpoints would become invalid (unless you do some masking or something). Also, it might be possible for a large client to determine information about the recipient (e.g. because the UAID starts with "1" this user is on Android using FCM*)

Since we needed so much shared data that was tied to a unique identifier, we decided to use that as a key for a database lookup.

There were also a few other strong reasons why we wanted to make sure that individual device data was difficult to tie back to a given user. In short: Privacy scales remarkably well, particularly around potential costs.

Does this info help you?

[* Yeah, I know. This is just for simplicity. I have seen folks with lots of data become super clever about breaking hashes, though, no matter how complex the MAC, so just be aware that it may be possible]

GingerMoon commented 4 years ago

Thanks @jrconlin for sharing the valuable insights.

Since the servers handling the websocket connections could only handle several hundred thousand connections, we had to have a "routing system" to be able to find which node a given device was connected to. Since that "route" could be created or expire at any time, it didn't make sense to send the routing info to all the other nodes, but instead have it stored in one, central database.

The "routing system" is the topic I'd like to discuss to see if we can make any further improvement. If both assigining a device to a CEP and finding a CEP to push the message can use the say way/rule then we don't need to store/query the information of <deviceID, cepID> in the database. An naive way is, say we have 16 CEPs, we use the hash function of mod 16. Then we don't need to store/query the information "<device1, cep1>...<device17, cep1>" in the database.

Then what happens if cep1 is down? We can use "consistent hashing" instead of the simple "mod 16". Now we have a "hash ring" and on the ring cep2 is next to cep1. cep1 is next to cep16. If cep1 is down, then device1 will be mapped to the next cep on the hash ring, which is cep2. Later if cep1 is back and start to work again, then cep2 will terminate the connection to device1 and device1 will be binded to cep1 again. It's similar in the case of adding/removing CEPs.

It's similar as what Cassandra does. https://docs.datastax.com/en/ddac/doc/datastax_enterprise/dbArch/archDataDistributeHashing.html?hl=consisting%20hashing

jrconlin commented 4 years ago

Perhaps we're both thinking about this the wrong way...

The UAID and CHID are both values that are internal to Autopush. The UAID is merely a way to uniquely identify a recipient device, with the CHID being a way to uniquely identify the recipient web app. Both of those values are encoded into the CEP which renders as opaque to anything outside of Autopush. The UAID and CEP are both generated by the Push server (autopush).

So, why not include your hash into the CEP data set?

We already do something similar with "signed" endpoints.

Basically alter the endpoint generation code to include the UAID, CHID, your device hash, then send that through the endpoint encryptor. When you decrypt the endpoint you'd get all the values back, so you can hash that way.

One note of caution: Websockets do NOT handle redirects well, at all. They're not really HTTP calls so unless a connection is accepted, it's dropped by the client. I'm not sure how you're doing load management or TLS termination, but you may need to keep that in mind. Our load balancers are very simple and don't do any clever connection forwarding, so we have zero control over which machine a client connects to. If that's something you have control over, awesome!

GingerMoon commented 4 years ago

Thanks @jrconlin for replying to me again. Sorry if I didn't express it clearly.

Basically I want to find a way to eliminate the db query which is needed for finding the server to push the message to the target device.

Currently, every device has a ID which is generated in some way(as you mentioned). I assume the way doesn't confilicts with the proposal I am proposing. (If it does, I belive we can find some other way to fix it.)

Say we have 16 servers which keeps the long connections (websocket) to their responsible devices to push messages. We can use consistent hashing to map [deviceID, serverID]. Here is a quick introduction: link So we will have a hash ring, which is actually a "space" containing all the possible hashes of all the deviceIDs. The 16 servers also "sits on" the ring, which means every server is responsible for some parts of the "space". For example, on the "ring", server1 is responsible for device1. So we don't need to store/query this information in db because the related modules know device1 is mapped to server1. The related modules can be: "Assigning module", which is responsible for telling devices the push server to connect to. "Routing module", which is responsible for routing the message to the correct push server, which maintains the long connection(websockt) to the target device.

If server1 is down, then server2 (which is next to server1 in the "hash ring") will be responible for device1. The process is below: device1 finds sever1 is down, so device1 asks the "assigning module" for another push server. The "assigning module" then find server1 is down and has been moved out of the "hash ring". Then returns server2 (which was next to server1 on the hash ring) to device1. Device1 then maintains connections to server2.

At this time, if there is a message needs to be pushed to device1, the "routing module" finds server2 on the hash ring is responsible for device1, so the message is routed server2. server2 then push the message to device1.

There can be a problem because of the small time window for device1 to switch from server1 to server2. I believe it can be fixed if this case has already been taken into consideration of the current push design.

Later on, if server1 is back again, we will put it back to the hash ring again, and tell server2 to terminate its connection (websocket) to device1. Device1 finds its connection to server2 is closed, then it will trun to "assigning module" for help again. "Assigning module" will return server1 to device1.

The consistent hashing can distribute devices to push servers evenly. We can even use the concept of "virtual push servers".

Do you think this proposal is going to work? Any insights would be much appreciated!

jrconlin commented 4 years ago

I apologize if this is frustrating for you, but I believe we both may be saying the same things.

first off a summary: Yes, I think that might work for you, but there may be some additional costs you'll have to consider.

The long version:

I understand what you are proposing and agree that it should be possible to route incoming messages using a consistent hash. You could generate the routing hash locally using the UAID as a purely random seed and that should be good enough to ensure that all machines are balanced. So, from that point I don't think you'll have a great deal of difficulty modifying code to fit your desired design.

My concerns are more "larger scale" and may be irrelevant to your case. When we tested we found that each server could maintain about 200,000 connections reliably (with enough CPU and memory to handle any unexpected surges, sudden machine losses, socket handles for system and incoming data, etc.). Another concern was the fact that Websocket connections could not be re-routed after the initial connection request. (Well, they could, but it would require substantial changes on the client side, protocols, and potentially exposing "raw" machines rather than ones behind TLS terminating ELBs, which could potentially reduce the number of possible connections even more, since handling TLS can be memory and CPU intensive)

Each machine would still need to handle incoming messages from the endpoints, which will probably be a different set of machines for security reasons (You don't want to give potential attackers access to the unprotected Websocket connection if you can avoid doing that). You could use the same consistent hash technique to send the message to the "dispatch" machine.

I'm sure you've thought about this but allow me the following exercise. 😉

  1. [UAID 0001] = requests connection from Load Balancer and gets => [Node A123]
  2. [Node A123] does a consistent hash on "0001" and determines the dispatch node is [Node B456]
  3. [Node B456] passes whatever pending messages it has for [UAID 0001] for relay via [Node A123], and does all the other webpush protocol stuff it needs to.
  4. [Node A123] later determines that [UAID 0001] is no longer connected (either directly or indirectly via failed Ping response)
  5. [Node A123] notifies dispatch [Node B456] that [UAID 0001] is disconnected, and I presume returns a NAK for any pending relay messages and that it no longer is a relay for [UAID 0001].

That should work, although I'd be concerned about the amount of cross node traffic that would be generated. It's a good number of moving parts, and as my Mom used to tell me, "Every extra part is something that will break". Still, if you expect load to be fairly low, there should be plenty of CPU and socket handles available.

The thought about including the extra hash in the encrypted CEP data would be to simplify the consistent hashing. You could use a smaller random seed than 128 bits, which might be useful if you have a small set of nodes. It's not much of a savings, to be honest, but every bit can count sometimes. It can also be useful to point out another tool in your workbox, so to speak, just in case you might find it useful.

GingerMoon commented 4 years ago

Thanks a lot @jrconlin for your kindness! Sorry that I don't quite understand the "exercise". Below is my original thought, which is not the same as the example in the "exercise":

[Node B456] passes whatever pending messages it has for [UAID 0001] for relay via [Node A123], and does all the other webpush protocol stuff it needs to.

[NodeA123] is the "assigning module" I mentioned before. [Node B456] is the server which keeps the connection with [UAID 0001] and pushes message directly, without [Node A123].

[Node A123] later determines that [UAID 0001] is no longer connected (either directly or indirectly via failed Ping response)

The "assigning module" [Node A123] can never determines that [UAID 0001] is no longer connected.

Below is my understanding to the current push design (just in case my understanding is wrong):

The process of DeviceA creates connection to push Server A:

  1. DeviceA turn to "Assigning module" asking for a push server. (The assigning module could be a special server, or just the Load Balancer.
  2. "Assigning module" returns push serverA to deviceA.
  3. DeviceA creates websocket with serverA. And serverA writes <DeviceA, ServerA> into dababase.

(My proposal is: "Assigning module" returns the ServerA according to consistent hashing. <DeviceA, ServerA> doesn't need to be stored in DB.)

The process of deviceB sends message to deviceA:

  1. The application (twitter) on the deviceB sends the (notification) message to the application server (twitter server). Application server overlay the message to the "routing module".
  2. "Routing module" queries database and gets <deviceA, serverA>.
  3. "Routing module" overlay the message to serverA.
  4. ServerA pushes the message to deviceA.

(My proposal is: "Routing module" do the same consistent hashing and doesn't need to query DB.)

The process of deviceA connects to another serverA2 when the previous serverA is down:

  1. DeviceA finds its connection to serverA is not workable.
  2. DeviceA turns to "assigning module" asking for another server.
  3. "Assigning module" returns another serverA2 to deviceA.
  4. DeviceA creates connection to serverA2.
  5. There are pending messages which serverA1 hasn't been able to push to deviceA. If serverA2 is capable of finding these messages (for example, reads from DB or MQ), serverA2 pushes these pending messages. If serverA2 is not capable of finding these messages, then these messages are lost. (Anyway, it doesn't confilict to my proposal.)

Would you please confirm/correct my understanding?

jrconlin commented 4 years ago

No worries! An example like this helps us both understand each other.

So, my understanding is a bit different, but I can understand yours. For me: [UAID 0001] is the same thing as DeviceA (A "UAID" is a User Agent IDentifier, and is a unique identifier tied to a specific User Agent instance. Each firefox profile is a unique User Agent, whether it's a different profile on the same machine, or version of firefox on mobile devices, etc.)

When [UAID001] wants to use WebPush, it has to connect to a server. Like I've mentioned before, a server doesn't really have a lot of control over which User Agents connect, nor does it have a lot of control over directing the User Agent to a different server. This first connection is what you're calling the Assigning Module, correct? In our case it's a Load Balancer which opens it's own dedicated connection to a randomly picked Autopush server. We can't control which server it picks here, either, so for us we ignore the fact that there's a Load Balancer and just imagine the User Agent connecting to a random Autopush Server. In my example, this was what I meant by [Node A123]. It's also why I said that [Node A123] would know if [UAID 001] was connected or not.

[excuse my terrible art:] image

we don't really have a lot of control over what's in the cloud of "wss://push.services.mozilla.org". My original example had [Node A123] talk with the assigned "home" node of [Node B456] to get [UA001]'s data. That's what the dotted line is for.

I believe in your case, your Assigning Node does have some control over which Autopush server it can connect to, yes? If I understand, that's the "serverA" and "serverA2" you are talking about. If that's true, then yes, you can absolutely use the consistent hash mechanism to determine which "home" Autopush server the Assigning Node creates a tunnel for the User Agent.

[again, terrible art] image

If not, you may have the same problems that we have, since the Assigning Node can't say to the User Agent: "Please reconnect to this machine". (Unless you do significant changes to the Firefox WebPush User Agent code, which might introduce a lot of other serious concerns.)

So in your case, if [Node A123] were to fail for some reason, you could easily roll in a replacement and the Assignment Node would just tunnel [UA001] to that new node.

If that is the case, then yes, you can absolutely have the Assignment Node use a consistent hash to determine which Node to create a tunnel to and skip doing a database lookup.

GingerMoon commented 4 years ago

Hi @jrconlin Thanks a lot as always!

Does [Node A123] push the message directly without passing through the load balancer? From the doc, it seems that [Node A123] sends the message to push.service.m.c and then push.service.m.c sends the message to [UA001]. If it's true, then why do we need to care about which push server(or, connection nodes defined in the doc) serves which user agent? Routing nodes(or, endpoint nodes defined in the doc) can jus pick one of the push server(or, connection nodes defined in the doc) or can even send the message directly to the load balancer.

Unless you do significant changes to the Firefox WebPush User Agent code, which might introduce a lot of other serious concerns.

I think my proposal will not change the interface between the device and wss://push.services.mozilla.org. Say we have an interface/API called GetPushServer, which is called by deviceA asking for a push server. We can add a new server(cluster) for assigning the push server. The load balancer can just overlay the GetPushServer to this new server/cluster. Some example of the "a lot of other serious concerns" would be much appreciated!

jrconlin commented 4 years ago

The part that is the cloud is the load balancer. For us, it's an Amazon ELB because that's the platform we're using. Each ELB handles accepting the websocket connection and terminating the incoming TLS connection. The ELB then opens another websocket connection to one of our 200+ endpoint servers (it's semi-random which server it connects to) which handles the actual WebPush protocol. This also lets us spin up or wind down servers as needed.

We use ELBs for TLS termination because TLS puts significant demands on a machine, both for memory and CPU, which drives up costs to offer the service dramatically. We would need a LOT more servers in order to handle the load we do, and frankly, it's fairly significant now.

I'll also note that we considered a great many different connection protocols and found that using a database was significantly cheaper than any other approach. Yes, we pay a tiny fraction of a cent for each lookup, but compared to the costs of running extra machines and having a wildly varying service demand, it saves us money. Your case may not be the same, of course, if you were to run your own autopush server.

As for changing the UA interface, each call out can introduce some risk of interception and redirection. Also, the additional calls can increase bandwidth and server costs. In essence, you're describing exactly what we're getting with ELBs, albeit, there is no second connection to a different machine. (One problem we've seen with exposing a "real" machine address is that some traffic will attempt to connect directly to that one machine, which makes it "hot" causing the system to churn through nodes which increases overall costs. Plus, we have found that simplifying protocols as much as possible means that other teams don't make mistakes when implementing them. It's better to have just one entry point rather than a complex dance, particularly when you also have to service a long history of versions of the UA that are still actively in use.

We're quite happy with the setup we have. We always look for cost savings, of course, but not all costs are purely technical or easily determined. Offering WebPush is subtly difficult, particularly at the scale of several tens of millions of active users.

Of course, how you may decide to offer WebPush is up to you and you're always free to modify our code to suit your needs.

GingerMoon commented 4 years ago

Thanks a lot @jrconlin for sharing so much, and also for the nice art!

shauryarjain commented 3 years ago

@jrconlin I think this might be the right thread to continue the discussion because of the nice art.

How do you deal with TLS terminating ELBs hitting connection limits when distributing websocket connections in wss://push.services.mozilla.org?

Brian Pitts @sciurus mentioned on Hacker News [https://news.ycombinator.com/item?id=21222913] that in an ideal world you would distribute connections yourself, but it sounds like you guys are using standard ELBs which may not allow you to scale to 100k+ concurrent connections per ELB

Do you use multiple ELBs or configure the ELB in a special way to handle additional connections & if so how do you deal w DNS?

sciurus commented 3 years ago

Hi shauryarjain, as mentioned on Hacker News AWS support has scaled up our ELB to handle our peak connections.

shauryarjain commented 3 years ago

@sciurus I see, that makes a lot of sense. Do you have to periodically check in w AWS w peak connection forecasts for your ELB or do they handle autoscaling?

Relying on manual pre-warming by AWS support seems a little dangerous for highly available systems

sciurus commented 3 years ago

We've had trouble in the past but not in some time. Our load is pretty predictable now.

shauryarjain commented 3 years ago

I see, do you pay AWS a premium for the custom ELB by peak connections then?

sciurus commented 3 years ago

I can't discuss AWS pricing.