slact / nchan.js

NPM package for the Javasript client for Nchan
Other
91 stars 25 forks source link

Debugging connections issues #16

Open ivanovv opened 6 years ago

ivanovv commented 6 years ago

I'm trying to debug strange connections issues and I don't know what step to take next.

Here is all the info I have right now:

1) overview: the app is a collaboration app, with one (or more) presenter and several viewers. Presenters and viewers are connected to nginx with nchan via websockets using NchanSubscriber. Presenters can send simple POST requests to the Rails app, Rails will publish the events via nchan to all connected viewers.

2) There are several server configs (servers listening on different domains with their own SSL certs), all of them have the same nchan related nginx config

client_max_body_size 128M;

location ~ /sub/(.*)$ {
    nchan_subscriber;
    nchan_channel_id $1;
    #nchan_channel_group test;
    nchan_channel_events_channel_id $1;
    nchan_subscriber_first_message oldest;
    nchan_subscribe_request /upstream/sub;
    nchan_unsubscribe_request /upstream/unsub;
}

location = /upstream/unsub {
    proxy_pass http://main_server/ws/unsub;
    proxy_ignore_client_abort on;  #!!!important!!!!
    proxy_set_header X-Subscriber-Type $nchan_subscriber_type;
    proxy_set_header X-Channel-Id $nchan_channel_id;
    proxy_set_header X-Original-URI $request_uri;
}
location = /upstream/sub {
    proxy_pass http://main_server/ws/sub;
    proxy_set_header X-Subscriber-Type $nchan_subscriber_type;
    proxy_set_header X-Message-Id $nchan_message_id;
    proxy_set_header X-Channel-Id $nchan_channel_id;
    proxy_set_header X-Original-URI $request_uri;
}

And there is a single pub server, available only locally:

server {
  listen   10.0.0.90:80;
  server_name 10.0.0.90;

  access_log  /var/log/nginx/localhost.access.log;

  location ~ /pub/(.*)$ {
    nchan_publisher;
    nchan_channel_id $1;
    #nchan_channel_group test;
    #nchan_channel_events_channel_id $1;
    nchan_message_buffer_length 0;
    nchan_message_timeout 30s;
  }

  location ~ /channel_events/(.*)$ {
    #channel events subscriber location
    nchan_subscriber;
    nchan_channel_group meta; # "meta" is a SPECIAL channel group
    nchan_channel_id $1;
  }

  location /nchan_stub_status {
    nchan_stub_status;
  }

  location / {
    root   /var/www/nginx-default;
    index  index.html index.htm;
  }
}

3) nginx -V

nginx version: nginx/1.10.1
built with OpenSSL 1.0.1f 6 Jan 2014
TLS SNI support enabled
configure arguments: --with-cc-opt='-g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat -Werror=format-security -D_FORTIFY_SOURCE=2' --with-ld-opt='-Wl,-Bsymbolic-functions -Wl,-z,relro -Wl,-z,now' --prefix=/usr/share/nginx --conf-path=/etc/nginx/nginx.conf --http-log-path=/var/log/nginx/access.log --error-log-path=/var/log/nginx/error.log --lock-path=/var/lock/nginx.lock --pid-path=/run/nginx.pid --modules-path=/usr/lib/nginx/modules --http-client-body-temp-path=/var/lib/nginx/body --http-fastcgi-temp-path=/var/lib/nginx/fastcgi --http-proxy-temp-path=/var/lib/nginx/proxy --http-scgi-temp-path=/var/lib/nginx/scgi --http-uwsgi-temp-path=/var/lib/nginx/uwsgi --with-debug --with-pcre-jit --with-ipv6 --with-http_ssl_module --with-http_stub_status_module --with-http_realip_module --with-http_auth_request_module --with-http_v2_module --with-http_dav_module --with-threads --with-http_addition_module --with-http_flv_module --with-http_geoip_module --with-http_gunzip_module --with-http_gzip_static_module --with-http_image_filter_module --with-http_mp4_module --with-http_random_index_module --with-http_secure_link_module --with-http_sub_module --with-http_xslt_module --with-mail --with-mail_ssl_module --with-stream --with-stream_ssl_module --add-module=/build/nginx/debian/modules/headers-more-nginx-module --add-module=/build/nginx/debian/modules/nginx-auth-pam --add-module=/build/nginx/debian/modules/nginx-cache-purge --add-module=/build/nginx/debian/modules/nginx-dav-ext-module --add-module=/build/nginx/debian/modules/nginx-development-kit --add-module=/build/nginx/debian/modules/nginx-echo --add-module=/build/nginx/debian/modules/ngx-fancyindex --add-module=/build/nginx/debian/modules/nchan --add-module=/build/nginx/debian/modules/nginx-lua --add-module=/build/nginx/debian/modules/nginx-upload-progress --add-module=/build/nginx/debian/modules/nginx-upstream-fair --add-module=/build/nginx/debian/modules/ngx_http_substitutions_filter_module

OK, to the problem itself.

Here are several logs from client side:

I, [2018-07-09T17:25:39.608787 #4071]  INFO -- : PAGE: First event after page (re)load
I, [2018-07-09T17:25:39.608825 #4071]  INFO -- : PAGE: Browser: Firefox 61.0 (Windows)
I, [2018-07-09T17:25:39.608877 #4071]  INFO -- : PAGE: UA: Mozilla/5.0 (Windows NT 10.0; rv:61.0) Gecko/20100101 Firefox/61.0
I, [2018-07-09T17:25:39.608903 #4071]  INFO -- : PAGE: Resolution: 1583x763
I, [2018-07-09T17:25:39.608925 #4071]  INFO -- : PAGE: Resize, mw: 1100px, md: true, fs: false
I, [2018-07-09T17:25:39.608947 #4071]  INFO -- : PAGE: Using NChan
I, [2018-07-09T17:25:39.608968 #4071]  INFO -- : PUBSUB: Connected
I, [2018-07-09T17:25:39.608989 #4071]  INFO -- : PUBSUB: Connected

and another example

I, [2018-07-09T17:28:14.835192 #25076]  INFO -- : PAGE: First event after page (re)load
I, [2018-07-09T17:28:14.835254 #25076]  INFO -- : PAGE: Browser: Safari 11.0 (iPhone)
I, [2018-07-09T17:28:14.835305 #25076]  INFO -- : PAGE: UA: Mozilla/5.0 (iPhone; CPU iPhone OS 11_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.0 Mobile/15E148 Safari/604.1
I, [2018-07-09T17:28:14.835360 #25076]  INFO -- : PAGE: Resolution: 320x454
I, [2018-07-09T17:28:14.835408 #25076]  INFO -- : PAGE: Resize, mw: 622px, md: false, fs: false
I, [2018-07-09T17:28:14.835457 #25076]  INFO -- : PAGE: Using NChan
I, [2018-07-09T17:28:14.835505 #25076]  INFO -- : PAGE: Mobile device
I, [2018-07-09T17:28:14.835556 #25076]  INFO -- : AUTOPLAY: Not allowed
I, [2018-07-09T17:28:14.835608 #25076]  INFO -- : STREAMS: Set method to JWplayer
I, [2018-07-09T17:28:14.835667 #25076]  INFO -- : PUBSUB: Connected
I, [2018-07-09T17:28:14.835718 #25076]  INFO -- : PUBSUB: Connected

PUBSUB: Connected is emitted by pubsub.js

Pubsub = {
  init: function () {
    self = this;
    self.online = false;
    for (var i = 0; i < nchanUrls.length; i++ ) {
      var nchanUrl = nchanUrls[i];
      var sub = new NchanSubscriber(nchanUrl, { subscriber: ['websocket', 'longpoll'] });

      sub.on('message', Pubsub.nchanMessage);

      sub.on('connect', function(evt) {
        self.online = true;
        ajaxLog('PUBSUB: Connected');
      });

      sub.on('disconnect', function(code, text){
        self.online = false;
        if (!text) {
          var evt = code;
          code = evt.code;
          text = evt.reason;
        }
        ajaxLog('PUBSUB: Disconnected: ' + code + ' desc: ' + text);
      });

      // sub.on('error', function(evt, desc){
      //   ajaxLog('PUBSUB: Error: ' + evt + ' desc: '+ desc);
      // });

      sub.start();
    }
  },

  nchanMessage: function(message, message_metadata) {
    // message is a string
    // message_metadata is a hash that may contain 'id' and 'content-type'
    message = JSON.parse(message);
    eval(message.data.eval);
  }
}

So connect event was triggered, but NchanSubscriber has received no events that were sent to /pub endpoint

Looking at the nginx access.log I see a lot of lines with 499 response code:

212.0.0.226 - - [09/Jul/2018:17:29:35 +0000] "GET /sub/event/62728 HTTP/1.1" 499 11 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 11_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.0 Mobile/15E148 Safari/604.1"
212.0.0.226 - - [09/Jul/2018:17:29:35 +0000] "GET /sub/event/62728/private/682916 HTTP/1.1" 499 768 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 11_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.0 Mobile/15E148 Safari/604.1"
125.0.0.216 - - [09/Jul/2018:17:39:38 +0000] "GET /sub/event/62728 HTTP/1.1" 101 2 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
125.0.0.216 - - [09/Jul/2018:17:39:38 +0000] "GET /sub/event/62728/admin HTTP/1.1" 101 3720 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
194.0.0.145 - - [09/Jul/2018:17:46:44 +0000] "GET /sub/event/62728 HTTP/1.1" 499 12787 "-" "Mozilla/5.0 (Windows NT 10.0; rv:61.0) Gecko/20100101 Firefox/61.0"
212.0.0.226 - - [09/Jul/2018:17:49:59 +0000] "GET /sub/event/62728 HTTP/1.1" 101 3660 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 11_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.0 Mobile/15E148 Safari/604.1"
212.0.0.226 - - [09/Jul/2018:17:49:59 +0000] "GET /sub/event/62728/private/682916 HTTP/1.1" 101 2 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 11_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.0 Mobile/15E148 Safari/604.1"
184.0.0.122 - - [09/Jul/2018:17:52:32 +0000] "GET /sub/event/62728 HTTP/1.1" 499 12440 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 11_4 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15F79 [FBAN/FBIOS;FBAV/179.0.0.50.82;FBBV/116150041;FBDV/iPhone7,1;FBMD/iPhone;FBSN/iOS;FBSV/11.4;FBSS/3;FBCR/KPNNL;FBID/phone;FBLC/nl_NL;FBOP/5;FBRV/0]"
184.0.0.122 - - [09/Jul/2018:17:52:32 +0000] "GET /sub/event/62728/private/684856 HTTP/1.1" 499 1546 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 11_4 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15F79 [FBAN/FBIOS;FBAV/179.0.0.50.82;FBBV/116150041;FBDV/iPhone7,1;FBMD/iPhone;FBSN/iOS;FBSV/11.4;FBSS/3;FBCR/KPNNL;FBID/phone;FBLC/nl_NL;FBOP/5;FBRV/0]"
194.0.0.145 - - [09/Jul/2018:17:58:09 +0000] "GET /sub/event/62728/private/674743 HTTP/1.1" 499 1590 "-" "Mozilla/5.0 (Windows NT 10.0; rv:61.0) Gecko/20100101 Firefox/61.0"
212.0.0.226 - - [09/Jul/2018:18:14:51 +0000] "GET /sub/event/62728 HTTP/1.1" 499 17823 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 11_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.0 Mobile/15E148 Safari/604.1"
194.0.0.145 - - [09/Jul/2018:18:15:01 +0000] "GET /sub/event/62728 HTTP/1.1" 499 23055 "-" "Mozilla/5.0 (Windows NT 10.0; rv:61.0) Gecko/20100101 Firefox/61.0"
185.0.0.148 - - [09/Jul/2018:18:18:29 +0000] "GET /sub/event/62728 HTTP/1.1" 101 34975 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
185.0.0.148 - - [09/Jul/2018:18:18:29 +0000] "GET /sub/event/62728/admin HTTP/1.1" 101 1312607 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
185.0.0.148 - - [09/Jul/2018:18:18:29 +0000] "GET /sub/event/62728/private/684815 HTTP/1.1" 101 2 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"
212.0.0.226 - - [09/Jul/2018:18:33:50 +0000] "GET /sub/event/62728/private/682916 HTTP/1.1" 499 738 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 11_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.0 Mobile/15E148 Safari/604.1"
212.0.0.226 - - [09/Jul/2018:18:33:54 +0000] "GET /sub/event/62728/private/682916 HTTP/1.1" 499 738 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 11_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.0 Mobile/15E148 Safari/604.1"
212.0.0.226 - - [09/Jul/2018:18:34:06 +0000] "GET /sub/event/62728 HTTP/1.1" 499 4351 "-" "Mozilla/5.0 (iPhone; CPU iPhone OS 11_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/11.0 Mobile/15E148 Safari/604.1"
194.0.0.145 - - [09/Jul/2018:18:34:16 +0000] "GET /sub/event/62728 HTTP/1.1" 499 4351 "-" "Mozilla/5.0 (Windows NT 10.0; rv:61.0) Gecko/20100101 Firefox/61.0"

So, long story short: NchanSubscriber says we are connected, but won't receive anything from the server.

The questions I have: 1) it is a correct config when the pub server is a separate server that listen local IP only? And there is a number of sub locations in different servers 2) Is there a way to track connection errors within NchanSubscriber? 3) Is there a way to make NchanSubscriber reconnect when connection is lost?

gazugafan commented 5 years ago

Were you ever able to debug this?

ivanovv commented 5 years ago

Nope.

We are sort of satisfied with what we have now, though lack of observability and community pushes me to rewrite everything WS related with something that has all of the above. I chose Elixir/Phoenix

But back to Nchan (which is a great tool nonetheless and I think nchan's problem and blessing is that it built atop of nginx, so many things come for free, but debugging is really problematic)

I used a fork of NchanSubscriber that uses several transport one after another (https://github.com/interviewstreet/nchan.js), stopped using one connection for several tabs, but other than that – I don't remember what we changed exactly. Oh, there were changes to nchan itself since the issue was opened. My best guess would be that there is some firewall or something that prevents normal WS connection and longpoll fallback would be the best workaround in that case. But it's just a guess.

So long story short – like I said, we are sort of satisfied for now, everything works. We had issues with nchan segfaulting, but switching to single nginx worker mostly mitigates the crashes as nginx just restarts. And the last crash was almost half a year ago.

slact commented 5 years ago

nchan's problem and blessing is that it built atop of nginx

tell me about it...