wandenberg / nginx-push-stream-module

A pure stream http push technology for your Nginx setup. Comet made easy and really scalable.
Other
2.22k stars 295 forks source link

lastest pushstream.js has significant performance issues #213

Closed flesner closed 8 years ago

flesner commented 8 years ago

last week we upgraded our nginx pushstream module and then on friday we updated the pushstream.js file. when the new pushstream.js went out the cpu load went from 20% to 100%. it appears from the nginx logs that the same IP's are hammering the server multiple times per second. we fell back to an old pushstream.js file and everything is back to normal. looks like the old version of pushstream.js we were using is: 029daba7a82000f005388309519b169a57e00892.

nginx module access log from a single IP/single second:

209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620695410&tag=&time=&eventid= HTTP/1.1" 200 6 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620696143&tag=&time=&eventid= HTTP/1.1" 200 6 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620696209&tag=&time=&eventid= HTTP/1.1" 200 6 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620696273&tag=&time=&eventid= HTTP/1.1" 200 6 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620696205&tag=&time=&eventid= HTTP/1.1" 200 6 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620697483&tag=&time=&eventid= HTTP/1.1" 200 6 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620699143&tag=&time=&eventid= HTTP/1.1" 200 4 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620700739&tag=&time=&eventid= HTTP/1.1" 200 4 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620702195&tag=&time=&eventid= HTTP/1.1" 200 4 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620703144&tag=&time=&eventid= HTTP/1.1" 200 4 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620707126&tag=&time=&eventid= HTTP/1.1" 200 4 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620724571&tag=&time=&eventid= HTTP/1.1" 200 2 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620710132&tag=&time=&eventid= HTTP/1.1" 200 2 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620724575&tag=&time=&eventid= HTTP/1.1" 200 2 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620724590&tag=&time=&eventid= HTTP/1.1" 200 2 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620724585&tag=&time=&eventid= HTTP/1.1" 200 2 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620724610&tag=&time=&eventid= HTTP/1.1" 200 2 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620755176&tag=&time=&eventid= HTTP/1.1" 200 0 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620755186&tag=&time=&eventid= HTTP/1.1" 200 0 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620755204&tag=&time=&eventid= HTTP/1.1" 200 0 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620755304&tag=&time=&eventid= HTTP/1.1" 200 0 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?=1445620755300&tag=&time=&eventid= HTTP/1.1" 200 0 "-" "" 209.213.152.108 - - [23/Oct/2015:13:23:05 -0400] "GET /ws/buss?_=1445620755307&tag=&time=&eventid= HTTP/1.1" 200 0 "-" ""

wandenberg commented 8 years ago

Did you follow the changes on changelog? There is a long distance between the version you used and the latest. Some adjustments have to be done on the code and on the server configuration. The server and client should use compatible versions to work well , or do some adjustments if you want to use an old server with a new client or vice versa. Just update the files is not enough.

flesner commented 8 years ago

i read through the release notes and don't see any changes other than server directive names which we have done. it still appears to be a performance issue to me, the old server responded to /pubs in 500ms with 22,000 connections, the newer server is take 1.1s with only 11,000 connections. our server config:

push_stream_shared_memory_size              128m;
push_stream_max_channel_id_length           200;
# max messages to store in memory
push_stream_max_messages_stored_per_channel  20;
# message ttl
push_stream_message_ttl                      5m;
# ping frequency
push_stream_ping_message_interval           10s;
# connection ttl to enable recycle
push_stream_subscriber_connection_ttl        15m;
# connection ttl for long polling
push_stream_longpolling_connection_ttl        30s;
# message template
push_stream_message_template                "{\"id\":~id~,\"channel\":\"~channel~\",\"text\":\"~text~\"}";

# subscriber may create channels on demand or only authorized
# (publisher) may do it?
push_stream_authorized_channels_only        off;
wandenberg commented 8 years ago

There are some changes on pushstream.js also just as example

Renamed some javascript client configurations
jsonDataKey → jsonTextKey
longPollingUseJSONP → useJSONP
longPollingTagArgument → tagArgument
longPollingTimeArgument → timeArgument
longPollingByHeaders → messagesControlByArgument (value must be changed appropriately)
reconnecttimeout → reconnectOnTimeoutInterval
checkChannelAvailabilityInterval → reconnectOnChannelUnavailableInterval
secondsAgo → messagesPublishedAfter

The response time does not have so much relation with the javascript code which was your issue.

Can you share your javascript code, the push stream commit you are currently using and the nginx -V result?

flesner commented 8 years ago
pushstreamInit: function() {
    // pick a random host
    var host = m2g.config.pushstream.hosts[Math.floor(Math.random() * m2g.config.pushstream.hosts.length)];
    var pushstream = new PushStream({
        host: host,
        port: m2g.config.pushstream.port,
        modes: m2g.config.pushstream.modes,
        useSSL: m2g.config.pushstream.useSSL
    });

    pushstream.onmessage = function(text, id, channel) {
        /*console.log('@@##@# onPushStreamMessage');
        console.log(text);
        console.log(id);
        console.log(channel);*/
        if (channel == m2g.config.callsign+'_ad') {
            //ad channel -- show sync display ad
            if (text.indexOf('http') == 0) {
                //okay have banner URL in event, go ahead and display
                tgmp.view.showPlayerSyncAd(text);
            }
        } else {
            setTimeout(function() {
                tgmp.view.hidePlayerSyncAd();
                m2g.controller.refresh_nowplaying(false);
            }, Math.floor(Math.random()*5000));
        }
    };

    pushstream.onerror = function(status) {
        console.log('pushstream error: '+status.type);
        if (status.type==='load') {
            this.disconnect();
            setTimeout(function() {m2g.controller.pushstreamInit();}, 5000+Math.floor(Math.random()*5000));
        }
    };

    pushstream.addChannel(m2g.config.pushstream.prefix+m2g.config.callsign);
    pushstream.addChannel(m2g.config.pushstream.prefix+m2g.config.callsign+'_ad');
    pushstream.connect();

    m2g.controller._pushstream = pushstream;

    console.log('pushstream listening to: '+host);
},

commit b5db1acb613995e924357f1bc566d055ada3b4a4

bin/nginx -V nginx version: nginx/1.8.0 built by gcc 4.8.3 20140911 (Red Hat 4.8.3-9) (GCC) built with OpenSSL 1.0.1k-fips 8 Jan 2015 TLS SNI support enabled

configure arguments: --with-http_ssl_module --add-module=../nginx-push-stream-module --with-http_stub_status_module

flesner commented 8 years ago

i also get a bunch of 400 response codes from the server?

63.116.198.141 - - [23/Oct/2015:09:23:10 -0400] "GET /ws/wnpm?=1445606589706&tag=&time=&eventid= HTTP/1.1" 400 0 "-" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)" 63.116.198.141 - - [23/Oct/2015:09:23:10 -0400] "GET /ws/wnpm?=1445606589636&tag=&time=&eventid= HTTP/1.1" 400 0 "-" "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0)"

flesner commented 8 years ago

and the pub thead is getting timeouts on a 10 second timeout on a relatively low load < 25% cpu on nginx server and around 11,000 connections:

ConnectTimeout: HTTPSConnectionPool(host='pusher', port=443): Max retries exceeded with url: /pub?id=wwkl (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f51e13370d0>, 'Connection to pusher timed out. (connect timeout=10)'))

wandenberg commented 8 years ago

@flesner regarding the 400 response codes you are getting check the log error, probably is a client which did not send the needed headers.

There is no apparently reason for you are having performance issues after update the javascript client. And you are the only one complaining about this, probably due a configuration problem after the server update. If you want to, I can try to review your server/application, just give me access to do it.

flesner commented 8 years ago

sorry, it' s a client's production server so i can't give you access to that. i could try to set up another server with the same config. do you have a recommendation on load testing of websockets other than writing a client myself? and is there a way to monitor the memory usage params like push_stream_shared_memory_size?

thanks, daniel

flesner commented 8 years ago

all right, i've narrowed the problem down to only IE 11 and it appears to be only on Windows 7. the server and client go into a tight request loop with the server rejecting the client's request with a 400 error and the client making the same request over and over. i've seen 52 requests in a single second from a single browser.

request: 216.243.26.90 - - [02/Nov/2015:17:59:58 -0500] "GET /ws/buss?_=1446505198894 HTTP/1.1" 400 0 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko"

error: 2015/11/02 17:59:58 [error] 2367#0: 3993947 push stream module: Don't have at least one of the mandatory headers: Connection, Upgrade, Sec-WebSocket-Key and Sec-WebSocket-Version, client: 216.243.26.90, server: .pusher, request: "GET /ws/buss?_=1446505198895 HTTP/1.1", host: "pusher"

i don't have any window platforms or any idea how to run IE dev tools, but will keep looking if this isn't enough information.

wandenberg commented 8 years ago

Try to use one windows virtual machine available here Or this windows does not use the last websocket implementation, or the client is behind some kind of proxy the remove the necessary headers. In any case it should do the fallback to one of the other modes (EventSource, stream, long polling). Check if this is happening and if your client and server are properly configured to this.

flesner commented 8 years ago

i'm confused by this change in the pushstream.js in 9d1e6226d247dd7676732b9a5d7a73ed033f41a5. it appears a normal close from the server is now considered a "load" error. that appears to do several things, first it now waits 60 seconds to reconnect on a normal server timeout losing 60 seconds worth of data. second it calls the client's onerror with a incorrect "load" error state. can you clarify why this was changed?

@@ -428,7 +438,7 @@ return; } this._closeCurrentConnection();

wandenberg commented 8 years ago

The change was done to have the same behavior for a "channel not available", that can be due a channel that was deleted ("close" type), or a channel that was not created yet and the subscribers is not allowed to create it ("load" type). In any case, you can change the interval to a new try of connection setting the configurations reconnectOnTimeoutInterval or reconnectOnChannelUnavailableInterval, instead of using their default values.

Have you figure out what was the performance problem that you have had?

flesner commented 8 years ago

yeah, the above is an issue for us, you've gotten rid of the close/"timeout" and converted everything to a "load" error. now even a connection ttl from the server is considered a "load" error. we were assuming a timeout was a normal connection ttl disconnect and the socket reconnected itself. the "load" error was a more unrecoverable error so we were doing a linear backoff on the websocket reconnect. the pushstream.js is now also doing a setTimeout reconnect 60 seconds later on every load error. since every close is now a "load" error this created a connection leak and excess connections. on top of that we seem to have some users behind high end security firewalls which were closing the websocket on connect and that created a connection leak every 60 seconds since it was no longer a close, but a "load" error now. so what may have been a small issue before with the linear backoff now was an huge issue since every connection to every user generates a "load" error at least every 15 minutes and many every 60 seconds.

i would consider the above a bug, since you've completely changed the behavior of the code. you've gotten rid of the close/timeout and report a "load" error on even a simple ttl disconnect now. we are also disconnected from the server 1 of every 16 minutes with the default settings, losing about 6% of our data. i can submit a patch fix, but we fixed it by simply taking out the type === 'close' check on line 538 as above.

wandenberg commented 8 years ago

I don't think that way you expose your problem be the best way to work in an open source project.

The change you are complaining for was done two years ago. You tried to update the code without checking if the software you did over the pushstream.js will still working properly, and now are you putting the blame on me?!

The pushstream.js is just a simple client to help, allowing an easy usage of the module. But you can create your client from scratch. Feel free to do it.

Of course you can propose any change. Also, I will review my changes to check if it is possible to:

Just keep in mind that this is a free and unpaid software. I'm trying to do the best software I can. But I'm not your employee and I'm not developing a software only for your needs.

I would say that this isn't a bug, since you can only adjust the configuration timeouts and remove the code you have done to reconnect. But I prefer try to improve the client code to give you the option to use the default retry policy or use your own.

I strongly believe that the community can and should work helping each other to do better software.

flesner commented 8 years ago

sorry, i didn't mean to blame you. you asked what the performance problem was and i tried to explain what i've found out in my digging. i didn't actually implement this originally, was just trying to fix a browser crash/memory leak originally. i believe the reconnect in the error handler was our problem and someone probably just didn't understand what was going on inside the pushstream.js originally. i have removed it and rely on your code to reconnect on "load" error, but we never see load errors now. i just thought that the === close changed the behavior in what seemed were unintentional ways, that's why i called it a bug? i can fork your repo and make a pull request if you like, but the only thing i ended up changing was removing the one boolean comparison from the expression.

thank you for all your hard work, it is a nice solution to our problem.

thanks again, daniel