parse-community / parse-server-push-adapter

A push notification adapter for Parse Server
https://parseplatform.org
MIT License
85 stars 100 forks source link

Large number of targeted Installations make the servers failed (499 errors) #123

Open SebC99 opened 5 years ago

SebC99 commented 5 years ago

Hello, We want to be able to send a push request to all our users in an area (more than 100k) but it fails the servers with far too many connexions (nginx 1024 worker_connections are not enough) and all the standard requests to the servers end up with 499 errors.

Our Parse servers are on Elastic Beanstalk, and we use a simple query new Parse.Query("Parse.Installation").exists("deviceToken") in the Parse.Push.send method.

SebC99 commented 5 years ago

anyone here?

acinader commented 5 years ago

Hi @SebC99

Not an issue I have run into personally.

Given the large number, can you use a queue and send them in smaller batches?

SebC99 commented 5 years ago

Why not, but I honestly don't know how do use this kind of queue ;) And I've tried to use the batchSize parameter in the query or push methods, but with no better results. Which size of batches would you recommend anyway? Even 10.000 pushes take a lot of time (more than an hour)

dplewis commented 5 years ago

Have you tried PARSE_SERVER_ENABLE_EXPERIMENTAL_DIRECT_ACCESS=1? I recently hit max open file (TCP connection) on a completely separate issue.

Do you know where your connections are coming from / going to?

acinader commented 5 years ago

I don't know what the batch size should be. for large pushes, ideally, you could parallelize

@flovilmart looked into this in the past and I wrote https://github.com/parse-community/parse-server-sqs-mq-adapter, but i never used it.

SebC99 commented 5 years ago

@dplewis what do you mean? It's just the push adapter emits too many connections I guess so there is no room for any other requests. The batch feature of the adapter for identical payload doesn't seem to work... But it's very hard to understand the push and queue code ;)

davimacedo commented 5 years ago

I understand that the problem is not about sending the pushes. It seems that the pushes are successfully sent, right @SebC99 ? Can you observe in the push status if they are all sent?

What I saw sometimes is: the pushes are successfully sent but, as the clients receive them, they hit back the parse api and it makes the server to crash. Since you are noticing the worker_connections error in nginx, it might be the problem.

I see two possible solutions:

SebC99 commented 5 years ago

@davimacedo not at all!! Only a very small number are sent, like 5000

davimacedo commented 5 years ago

What is the status you see in your push status? Sending forever? How are you running your parse server process? Is it a docker container? A service? Have you noticed this process crashing when sending the pushes?

davimacedo commented 5 years ago

Have you tried batchSize < 5000 ?

SebC99 commented 5 years ago

Here's what is in the DB for the last try

{ 
    "_id" : "YsXd23ED", 
    "pushTime" : "2019-03-09T12:24:30.138Z", 
    "query" : "{\"deviceToken\":{\"$exists\":true}}", 
    "payload" : "{
        \"alert-fr\":{\"title\":\"XXXX\",\"body\":\"XXXX\"},
        \"alert\":{\"title\":\"XXXX\",\"body\":\"XXXX\"},
        \"category\":\"update\",
        \"channel\":\"remote_notifications\",
        \"campaign\":\"marketing\"
    \"}", 
    "source" : "rest", 
    "status" : "running", 
    "numSent" : NumberInt(1496), 
    "pushHash" : "c4bf3a4c2e953169ead4d9c034576006", 
    "_wperm" : [

    ], 
    "_rperm" : [

    ], 
    "_acl" : {

    }, 
    "_created_at" : ISODate("2019-03-09T12:24:30.140+0000"), 
    "_updated_at" : ISODate("2019-03-09T12:28:00.645+0000"), 
    "count" : NumberInt(3595), 
    "failedPerType" : {
        "android" : NumberInt(327), 
        "ios" : NumberInt(40)
    }, 
    "numFailed" : NumberInt(367), 
    "sentPerType" : {
        "android" : NumberInt(537), 
        "ios" : NumberInt(959)
    }
}
dplewis commented 5 years ago

@SebC99 This is what I was talking about https://github.com/parse-community/parse-server/pull/4173

With direct access there isn't any overhead but now that it uses HTTP interface that opens another connection. I think that's where your issue is coming from.

SebC99 commented 5 years ago

Thanks, I haven't noticed that one, I'll give it a try (direct access has failed me before for cloud functions, so I haven't try for push yet)

dplewis commented 5 years ago

Ignore that last comment looks like that has been updated. I don't know much about the push and queue code. I can try to run it locally and see what's causing the issue. I think is similar to what @davimacedo mentioned, something might be hitting the Parse API.

SebC99 commented 5 years ago

I'll try to investigate too. If I remember well, a lot of beforeFind or beforeSave were appearing in the log, and I think it was about _User class, but I'm not sure.

SebC99 commented 5 years ago

After some tests, PARSE_SERVER_ENABLE_EXPERIMENTAL_DIRECT_ACCESS seems to decrease the load of the server. But:

BTW, I understand numSent and numFailed values, but what is the count value?

SebC99 commented 5 years ago

And with VERBOSE, I can clearly see: MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 connect listeners added. Use emitter.setMaxListeners() to increase limit

I have the exact same error with batchSize set to 10 than batchSize set to 5000. Even for a push with 50 device tokens only!

SebC99 commented 5 years ago

I also noticed this weird error from node-apn: https://github.com/node-apn/node-apn/issues/653#issue-439213015

SebC99 commented 5 years ago

If it helps, I keep testing things:

dplewis commented 5 years ago

If you look here promises are serialized.

Maybe do something similar to https://github.com/parse-community/parse-server/pull/5420 to prevent a bottleneck.

Enqueue by PushStatusId or pushStatus.objectId

davimacedo commented 5 years ago

@SebC99 You said that it is much better without payload and that's interesting. I am wondering if the problem is the whole payload. Can you please try without sending \"alert-fr\":{\"title\":\"XXXX\",\"body\":\"XXXX\"}, in the payload? I am wondering if the problem is related to the locale feature.

BTW, count is total pushes that should be sent, sent is how many succeeded and failed is how many failed. Ideally count should be sent + failed. In your case, the status is "sending" forever. It uses to happen when some of your batches failed to send due to a server crash (and will never be sent again). Because of this, sent + failed is always < count and the status never changes. The 3 most common reasons that I've seen to make it happen: 1) The reason I told before - something hitting back the server, crashing the server process and therefore stopping the batched to be sent 2) The query that is submitted to mongodb for each batch timeouts: parse server uses skip/limit to build the batches and it sometimes doesn't perform well 3) When building the batches, the process that is running parse server hits the maximum RAM limit and the processes crash.

Would you be able to observe if some of these are likely to be happening?

SebC99 commented 5 years ago

@davimacedo I tried with a simple "alert" payload (not localized) and I do have the exact same thing.

dplewis commented 5 years ago

@SebC99 Thank you for providing detailed feedback. We have a general idea and suggestions on where the issue may be coming from.

Would you like to take a look at the serialized promises I pointed out https://github.com/parse-community/parse-server-push-adapter/issues/123#issuecomment-488357508 and submit a fix?