warmcat / libwebsockets

canonical libwebsockets.org networking library
https://libwebsockets.org
Other
4.78k stars 1.49k forks source link

.NET Client gets disconnected with 1002 #1517

Closed GeorgeShagov closed 5 years ago

GeorgeShagov commented 5 years ago

Hello Team! You are doing a great work indeed! There is a small (I hope) problem here. It stands like this: There is a linux (ubuntu 18.04) web-socket server, based on minimal-ws-server-threads example If I connect to the ws server from Mozilla WebSocket addon it works no issues. But if my colleagues are trying to connect using .NET WebSocket Client submitting number of messages in a row (like 100 every 50ms) the client gets disconnected with 1002 WebSocket error I have verbosed the logging to maximum:

int logs = LLL_USER | LLL_ERR | LLL_WARN | LLL_NOTICE
           | LLL_INFO    | LLL_PARSER                
                         | LLL_HEADER                
           | LLL_EXT | LLL_CLIENT                    
           | LLL_LATENCY                             
           | LLL_DEBUG                               
;

Yet I do not observer any errors into the logs. What we did we have extended the buffer size for a protocol:

#define LWS_PLUGIN_PROTOCOL_MINIMAL             \
   {                                            \
      "lws-minimal",                            \
      callback_minimal,                         \
      sizeof(struct per_session_data__minimal), \
      32768*1024,                               \
      0, NULL, 4096                             \
   }

And it helped! It was purely experimental trick. But it hasn't resolved the prb completely. If the client starts sending messages with no delay the server gets disconnected after 15-20 messages.

It all looks odd. I can provide logging, but they are huge and no errors into these, therefore it does not seem to be a worthy enterprise.

Can you help with this?

Thanks in advance George

sample message snippet from Mozilla's addon:


CLNT >> SRV: {"rqst":"auth","apikey":"key-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"}

SRV >> CLNT: {"rqst":"auth","apikey":"key-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX","rspns":"ok"}

CLNT >> SRV: {"cliOrderId":"WSTestNew84","price":"1","type":"lmt","tradingPairId":29,"amount":"1","side":"sell","rqst":"onew"}

SRV >> CLNT: {"cliOrderId":"WSTestNew84","price":"1","type":"lmt","tradingPairId":29,"amount":"1","side":"sell","rqst":"onew","orderId":"1552393107.993768.wstg1@0029.2","ug":13,"action":1,"userId":51983,"rspns":"ok"}

SRV >> CLNT: {"id":"1552393107.993768.wstg1@0029.2","userId":51983,"currencyId":1,"amount":"10000","lockedByOrder":"140","lockedByWithdrawal":"0","fee":"0","result":"1"}

SRV >> CLNT: {"avgPrice":"0","cliOrderId":"WSTestNew84","execPrice":"1","initPrice":"1","message":"","oqty":"1","orderId":"1552393107.993768.wstg1@0029.2","qty":"1","side":"sell","stat":"accepted","tradingPairId":29,"trdId":"1552393109.603879.3.ug13","type":"lmt","ug":13,"userId":51983,"userIdMaker":0,"userIdTaker":0}
lws-team commented 5 years ago

Is this using permessage-deflate extension? Try disabling it if so.

What platform is the lws server side running on? What version of lws?

It'll be helpful if there's a way I can reproduce this that doesn't involve the c# server stuff (I don't have windows).

GeorgeShagov commented 5 years ago

Thank you for the response, Andy

Is this using permessage-deflate extension? Try disabling it if so.

This extension is not used.

What platform is the lws server side running on?

This is a GC VPS

cexprd@matching:~$ uname -a
Linux matching 4.15.0-1025-gcp #26-Ubuntu SMP Sat Nov 17 00:02:47 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
cexprd@matching:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 18.04.1 LTS
Release:        18.04
Codename:       bionic
cexprd@matching:~$ cat /etc/*release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=18.04
DISTRIB_CODENAME=bionic
DISTRIB_DESCRIPTION="Ubuntu 18.04.1 LTS"
NAME="Ubuntu"
VERSION="18.04.1 LTS (Bionic Beaver)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 18.04.1 LTS"
VERSION_ID="18.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=bionic
UBUNTU_CODENAME=bionic

What version of lws?

libwebsockets-3.1.0 cmake -DCMAKE_INSTALL_PREFIX:PATH=$HOME -DLWS_WITH_PLUGINS=1 -DCMAKE_BUILD_TYPE=DEBUG ./../ Hope it helps.

It'll be helpful if there's a way I can reproduce this

I am working on this.

that doesn't involve the c# server stuff (I don't have windows).

Neither do I :) It's pain indeed.

lws-team commented 5 years ago

Thanks for the info... well, good, nobody has windows. AFAIK firing lots of packets at that shouldn't upset anything on lws side. So a reproducer on lws pieces would be great. I can also run 18.04 in a local vm easily enough.

You might try lws master and see if that makes any difference. Although v3.1 is the last stable version, v3.1.0 has 19 fixes already, on v3.1-stable branch as well

https://github.com/warmcat/libwebsockets/commits/v3.1-stable

GeorgeShagov commented 5 years ago

This is easily reproducible in perl, using AnyEvent::WebSocket::Client 0.12;

Here is the perl code:

use 5.010;                                                                             
use strict;                                                                            
use warnings;                                                                          
use JSON::Parse 'parse_json';                                                          
use JSON::Parse 'valid_json';                                                          
use JSON;                                                                              
use Data::Dumper;                                                                      
use Getopt::Long;                                                                      
use Config::IniFiles;                                                                  
use Time::HiRes qw(time);                                                              
use POSIX qw(strftime);                                                                
use Log::Log4perl qw(:easy);                                                           
use Time::HiRes;                                                                       
use AnyEvent::WebSocket::Client 0.12;                                                  

Log::Log4perl->easy_init($DEBUG);                                                      
my $logger;                                                                            

my $client = AnyEvent::WebSocket::Client->new;                                         
$logger = get_logger();                                                                

$client->connect('ws://##.##.39.66', '##.##.39.66', '7681')->cb(sub {                

  # make $connection an our variable rather than                                       
  # my so that it will stick around.  Once the                                         
  # connection falls out of scope any callbacks                                        
  # tied to it will be destroyed.                                                      
  our $connection = eval { shift->recv };                                              
  if($@) {                                                                             
    # handle error...                                                                  
    warn $@;                                                                           
    return;                                                                            
  }                                                                                    

  $logger->info("sending APIKEY");                                                     
  # send a message through the websocket...                                            
  $connection->send('{"rqst":"auth","apikey":"key-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"}');

  # recieve message from the websocket...                                              
  $connection->on(each_message => sub {                                                
    # $connection is the same connection object                                        
    # $message isa AnyEvent::WebSocket::Message                                        
    my($connection, $message) = @_;                                                    
    #say "Message got: $message";                                                      
    $logger->info($message->body);                                                     
    my $input = <STDIN>;                                                               
    $connection->send($input);                                                         
  });                                                                                  

  # handle a closed connection...                                                      
  $connection->on(finish => sub {                                                      
    # $connection is the same connection object                                        
    my($connection) = @_;                                                              
    $logger->info('Disconnected. '                                                     
        . 'Close code:' . $connection->close_code                                      
        . '; Close Reason: ' . $connection->close_reason                               
        . '; Close Error: ' . $connection->close_error                                 
    );                                                                                 
  });                                                                                  

  # close the connection (either inside or                                             
  # outside another callback)                                                          
  # $connection->close;                                                                

});                                                                                    

## uncomment to enter the event loop before exiting.                                   
## Note that calling recv on a condition variable before                               
## it has been triggered does not work on all event loops                              
AnyEvent->condvar->recv;                                                               

Output:

2019/03/13 10:27:36 {"cliOrderId":"perl-ws-cliordid-111","price":"1","type":"lmt","tradingPairI
d":29,"amount":"0.001","side":"sell","rqst":"onew","orderId":"1552472856.638446.wstg1@0029.2","
ug":13,"action":1,"userId":51983,"rspns":"ok"}
2019/03/13 10:27:36 null
2019/03/13 10:27:36 null
2019/03/13 10:27:36 null
2019/03/13 10:27:36 null
2019/03/13 10:27:36 null
2019/03/13 10:27:36 {"cliOrderId":"perl-ws-cliordid-112","price":"1","type":"lmt","tradingPairI
d":29,"amount":"0.001","side":"sell","rqst":"onew","orderId":"1552472856.688331.wstg1@0029.2","
ug":13,"action":1,"userId":51983,"rspns":"ok"}
2019/03/13 10:27:36 null
2019/03/13 10:27:36 null
2019/03/13 10:27:36 {"avgPrice":"0","cliOrderId":"perl-ws-cliordid-110","execPrice":"1","initPr
ice":"1","message":"","oqty":"0.001","orderId":"1552472856.588808.wstg1@0029.2","qty":"0.001","
side":"sell","stat":"accepted","tradingPairId":29,"trdId":"1552472856.708460.3.ug13","type":"lm
t","ug":13,"userId":51983,"userIdMaker":0,"userIdTaker":0}
2019/03/13 10:27:36 {"avgPrice":"0","cliOrderId":"perl-ws-cliordid-111","execPrice":"1","initPr
ice":"1","message":"","oqty":"0.001","orderId":"1552472856.638446.wstg1@0029.2","qty":"0.001","
side":"sell","stat":"accepted","tradingPairId":29,"trdId":"1552472856.708852.3.ug13","type":"lm
t","ug":13,"userId":51983,"userIdMaker":0,"userIdTaker":0}
2019/03/13 10:27:36 null
2019/03/13 10:27:36 {"avgPrice":"0","cliOrderId":"perl-ws-cliordid-112","execPrice":"1","initPr
ice":"1","message":"","oqty":"0.001","orderId":"1552472856.688331.wstg1@0029.2","qty":"0.001","
side":"sell","stat":"accepted","tradingPairId":29,"trdId":"1552472856.709223.3.ug13","type":"lm
t","ug":13,"userId":51983,"userIdMaker":0,"userIdTaker":0}
2019/03/13 10:27:36 null
2019/03/13 10:27:36 null
Use of uninitialized value in concatenation (.) or string at ./wss-cln.pl line 54, <STDIN> line
 22.
2019/03/13 10:27:36 Disconnected. Close code:1002; Close Reason: bad cont; Close Error:
^C

wstgw.r1.qa.0.stderr-20190313-095650.10.27.txt.gz

lws-team commented 5 years ago

Okay... so... what do I run that against?

GeorgeShagov commented 5 years ago

Hmm... We have taken this: minimal-ws-server-threads as a starting point, yet it was significantly refactored since that time :smile:

lws-team commented 5 years ago

I guess you don't actually have to do anything with what was read... is the problem still coming if you try the simplest ws server (and remove its RECEIVE callback from doing anything with the data)?

https://libwebsockets.org/git/libwebsockets/tree/minimal-examples/ws-server/minimal-ws-server

The problem is supposed to come from ws framing, so it should be enough.

GeorgeShagov commented 5 years ago

is the problem still coming if you try the simplest ws server (and remove its RECEIVE callback from doing anything with the data)?

Hm... this is true indeed. I have removed the messages from separate threads sending data to ws thread through the ring. And it is working fine, no issues. It only appears in case I am sending the message from the ring to ws client. The technique is taken from minimal-ws-server-threads using LWS_CALLBACK_EVENT_WAIT_CANCELLED -> LWS_CALLBACK_SERVER_WRITEABLE. What could be the prb I wonder?

lws-team commented 5 years ago

It only appears in case I am sending the message from the ring to ws client.

... is it that the ring overflows before it can transmit stuff out? Maybe try increasing the depth of the ring and see if that changes the behaviour.

If there's no tls, or it can be disabled, I would probably put tcpdump on it (redirect to a file in /tmp to improve chances of catching everything) and look at the last few packets on the wire myself.

GeorgeShagov commented 5 years ago

It is to be noted this is not a stable case. Sometimes it is working no prb.

... is it that the ring overflows before it can transmit stuff out?

Nope. I have reserved 8192 msgs into the ring:

vhd->ring = lws_ring_create(sizeof(struct SWSMsg), 8192,
                            __minimal_destroy_message);

But the issues happens after couple of messages only:

george@latoken-me2-pilot:~/cex/perl/r1$ cat wss-orders.1100-1200.txt | perl ./wss-cln.pl
2019/03/13 11:48:14 sending APIKEY
2019/03/13 11:48:14 {"rqst":"auth","apikey":"key-XXXXXXXXXXXXXXXXXXXXXX","rspns":"ok"
}
2019/03/13 11:48:14 {"cliOrderId":"perl-ws-cliordid-1100","price":"1","type":"lmt","tradingPair
Id":29,"amount":"0.001","side":"buy","rqst":"onew","orderId":"1552477694.303122.wstg1@0029.1","
ug":13,"action":1,"userId":51983,"rspns":"ok"}
2019/03/13 11:48:14 null
2019/03/13 11:48:14 null
2019/03/13 11:48:14 null
2019/03/13 11:48:14 null
2019/03/13 11:48:14 null
2019/03/13 11:48:14 {"cliOrderId":"perl-ws-cliordid-1101","price":"1","type":"lmt","tradingPair
Id":29,"amount":"0.001","side":"buy","rqst":"onew","orderId":"1552477694.354182.wstg1@0029.1","
ug":13,"action":1,"userId":51983,"rspns":"ok"}
2019/03/13 11:48:14 null
2019/03/13 11:48:14 null
2019/03/13 11:48:14 {"id":"1552477694.303122.wstg1@0029.1","userId":51983,"currencyId":2,"amoun
t":"10002.425","lockedByOrder":"42.08324866001","lockedByWithdrawal":"0","fee":"0","result":"1"
}
2019/03/13 11:48:14 {"id":"1552477694.354182.wstg1@0029.1","userId":51983,"currencyId":2,"amoun
t":"10002.425","lockedByOrder":"42.08424866001","lockedByWithdrawal":"0","fee":"0","result":"1"
}
2019/03/13 11:48:14 null
2019/03/13 11:48:14 null
Use of uninitialized value in concatenation (.) or string at ./wss-cln.pl line 54, <STDIN> line
 14.
2019/03/13 11:48:14 Disconnected. Close code:1002; Close Reason: bad cont; Close Error:
^C

george@latoken-me2-pilot:~/cex/perl/r1$ head -10 wss-orders.1100-1200.txt
{"cliOrderId":"perl-ws-cliordid-1100","price":"1","type":"lmt","tradingPairId":29,"amount":"0.0
01","side":"buy","rqst":"onew"}

{"cliOrderId":"perl-ws-cliordid-1101","price":"1","type":"lmt","tradingPairId":29,"amount":"0.0
01","side":"buy","rqst":"onew"}

I would probably put tcpdump on it (redirect to a file in /tmp to improve chances of catching everything) and look at the last few packets on the wire myself.

Sure

sudo tcpdump -nl -A -s 0 -i enp0s4 port 7681 -w ./tcpdump.10

tcpdump.10.gz

logs

wstgw.r1.qa.0.stderr-20190313-114400.11.48.14.txt.gz

GeorgeShagov commented 5 years ago

One thing to note (it was mentioned above, yet to confirm), If I do delay at least in 10ms in sending the messages it is working fine - no issues.

# recieve message from the websocket...      
$connection->on(each_message => sub {        
  # $connection is the same connection object
  # $message isa AnyEvent::WebSocket::Message
  my($connection, $message) = @_;            
  #say "Message got: $message";              
  $logger->info($message->body);             
  usleep(10000);                             
  my $input = <STDIN>;                       
  $connection->send($input);                 
});                                          
lws-team commented 5 years ago

So the 10.x.x.x is the lws server is it? And 167.x.x.x is the perl client?

Unfortunately I dunno what it's supposed to look like. Towards the end the server sends a text packet with a FIN which is OK.

19:48:14.374332 IP 10.132.0.3.7681 > 167.99.241.222.34096: Flags [P.], seq 876:1052, ack 569, win 246, options [nop,nop,TS val 122868196 ecr 2587967624], length 176
    0x0000:  4500 00e4 27c5 4000 4006 6e86 0a84 0003  E...'.@.@.n.....
    0x0010:  a763 f1de 1e01 8530 2e5b 904d 6ab2 db7f  .c.....0.[.Mj...
    0x0020:  8018 00f6 a49f 0000 0101 080a 0752 d1e4  .............R..
    0x0030:  9a41 4088 817e 00ac 7b22 6964 223a 2231  .A@..~..{"id":"1
    0x0040:  3535 3234 3737 3639 342e 3335 3431 3832  552477694.354182
    0x0050:  2e77 7374 6731 4030 3032 392e 3122 2c22  .wstg1@0029.1","
    0x0060:  7573 6572 4964 223a 3531 3938 332c 2263  userId":51983,"c
    0x0070:  7572 7265 6e63 7949 6422 3a32 2c22 616d  urrencyId":2,"am
    0x0080:  6f75 6e74 223a 2231 3030 3032 2e34 3235  ount":"10002.425
    0x0090:  222c 226c 6f63 6b65 6442 794f 7264 6572  ","lockedByOrder
    0x00a0:  223a 2234 322e 3038 3432 3438 3636 3030  ":"42.0842486600
    0x00b0:  3122 2c22 6c6f 636b 6564 4279 5769 7468  1","lockedByWith
    0x00c0:  6472 6177 616c 223a 2230 222c 2266 6565  drawal":"0","fee
    0x00d0:  223a 2230 222c 2272 6573 756c 7422 3a22  ":"0","result":"
    0x00e0:  3122 7d00                                1"}.

Then the client sends a text packet with a FIN and masking, a 1-byte payload 0xA (I think)

19:48:14.380648 IP 167.99.241.222.34096 > 10.132.0.3.7681: Flags [P.], seq 569:576, ack 700, win 254, options [nop,nop,TS val 2587967632 ecr 122868194], length 7
    0x0000:  4500 003b 6bb6 4000 3606 353e a763 f1de  E..;k.@.6.5>.c..
    0x0010:  0a84 0003 8530 1e01 6ab2 db7f 2e5b 8f9d  .....0..j....[..
    0x0020:  8018 00fe 0963 0000 0101 080a 9a41 4090  .....c.......A@.
    0x0030:  0752 d1e2 8181 9efd b8a1 94              .R.........

Server returns a 5-byte text packet with a FIN, "null\0'

19:48:14.380923 IP 10.132.0.3.7681 > 167.99.241.222.34096: Flags [P.], seq 1052:1059, ack 576, win 246, options [nop,nop,TS val 122868203 ecr 2587967632], length 7
    0x0000:  4500 003b 27c6 4000 4006 6f2e 0a84 0003  E..;'.@.@.o.....
    0x0010:  a763 f1de 1e01 8530 2e5b 90fd 6ab2 db86  .c.....0.[..j...
    0x0020:  8018 00f6 a3f6 0000 0101 080a 0752 d1eb  .............R..
    0x0030:  9a41 4090 8105 6e75 6c6c 00              .A@...null.

Client sends 2 x packets munged together... I think this coalescing is important for your problem...

1-byte TEXT, FIN + mask 0x0a again, followed by same again, another 1-byte 0x0a

19:48:14.388492 IP 167.99.241.222.34096 > 10.132.0.3.7681: Flags [P.], seq 576:590, ack 1059, win 270, options [nop,nop,TS val 2587967640 ecr 122868203], length 14
    0x0000:  4500 0042 6bb8 4000 3606 3535 a763 f1de  E..Bk.@.6.55.c..
    0x0010:  0a84 0003 8530 1e01 6ab2 db86 2e5b 9104  .....0..j....[..
    0x0020:  8018 010e 45f8 0000 0101 080a 9a41 4098  ....E........A@.
    0x0030:  0752 d1eb 8181 7c50 9c44 7681 81b7 a0e8  .R....|P.Dv.....
    0x0040:  fbbd                                     ..

Server replies to the first part of that packet with "null\0" again

19:48:14.388795 IP 10.132.0.3.7681 > 167.99.241.222.34096: Flags [P.], seq 1059:1066, ack 590, win 246, options [nop,nop,TS val 122868211 ecr 2587967640], length 7
    0x0000:  4500 003b 27c7 4000 4006 6f2d 0a84 0003  E..;'.@.@.o-....
    0x0010:  a763 f1de 1e01 8530 2e5b 9104 6ab2 db94  .c.....0.[..j...
    0x0020:  8018 00f6 a3f6 0000 0101 080a 0752 d1f3  .............R..
    0x0030:  9a41 4098 8105 6e75 6c6c 00              .A@...null.

Then when it thinks it's going to get the second part of the coalesced packet, he feels it's broken somehow

19:48:14.388992 IP 10.132.0.3.7681 > 167.99.241.222.34096: Flags [P.], seq 1066:1078, ack 590, win 246, options [nop,nop,TS val 122868211 ecr 2587967640], length 12
    0x0000:  4500 0040 27c8 4000 4006 6f27 0a84 0003  E..@'.@.@.o'....
    0x0010:  a763 f1de 1e01 8530 2e5b 910b 6ab2 db94  .c.....0.[..j...
    0x0020:  8018 00f6 a3fb 0000 0101 080a 0752 d1f3  .............R..
    0x0030:  9a41 4098 880a 03ea 6261 6420 636f 6e74  .A@.....bad.cont

Let me see what happens if I fire the coalesced packet at a test server here....

GeorgeShagov commented 5 years ago

I can try tshark if it helps

lws-team commented 5 years ago

No I think we have a good clue. I just need to set up the situation where I can provoke it with this coalesced packet artifically (it's not trivial).

GeorgeShagov commented 5 years ago

Then the client sends a text packet with a FIN and masking, a 1-byte payload 0xA (I think)

Yes, perl ws client is sending empty strings in order to fetch the messages from the ws server

george@latoken-me2-pilot:~/cex/perl/r1$ cat wss-orders.1100-1200.txt | perl ./wss-cln.pl
2019/03/13 11:48:14 sending APIKEY
2019/03/13 11:48:14 {"rqst":"auth","apikey":"key-XXXXXXXXXXXXXXXXXXXXXX","rspns":"ok"
}
2019/03/13 11:48:14 {"cliOrderId":"perl-ws-cliordid-1100","price":"1","type":"lmt","tradingPair
Id":29,"amount":"0.001","side":"buy","rqst":"onew","orderId":"1552477694.303122.wstg1@0029.1","
ug":13,"action":1,"userId":51983,"rspns":"ok"}
2019/03/13 11:48:14 null
2019/03/13 11:48:14 null
2019/03/13 11:48:14 null
2019/03/13 11:48:14 null
2019/03/13 11:48:14 null
2019/03/13 11:48:14 {"cliOrderId":"perl-ws-cliordid-1101","price":"1","type":"lmt","tradingPair
Id":29,"amount":"0.001","side":"buy","rqst":"onew","orderId":"1552477694.354182.wstg1@0029.1","
ug":13,"action":1,"userId":51983,"rspns":"ok"}
2019/03/13 11:48:14 null
2019/03/13 11:48:14 null
2019/03/13 11:48:14 {"id":"1552477694.303122.wstg1@0029.1","userId":51983,"currencyId":2,"amoun
t":"10002.425","lockedByOrder":"42.08324866001","lockedByWithdrawal":"0","fee":"0","result":"1"
}
2019/03/13 11:48:14 {"id":"1552477694.354182.wstg1@0029.1","userId":51983,"currencyId":2,"amoun
t":"10002.425","lockedByOrder":"42.08424866001","lockedByWithdrawal":"0","fee":"0","result":"1"
}
2019/03/13 11:48:14 null
2019/03/13 11:48:14 null
Use of uninitialized value in concatenation (.) or string at ./wss-cln.pl line 54, <STDIN> line
 14.
2019/03/13 11:48:14 Disconnected. Close code:1002; Close Reason: bad cont; Close Error:
^C

it's a bit hacky, but this is for testing only

lws-team commented 5 years ago

I can't casually reproduce it using the minimal ws server (echo variant) and your coalesced packet, this is on master. Can you try it on your server? nc is different on Ubuntu you might need to fiddle with it (this is on Fedora)

$ ./bin/lws-minimal-ws-server-echo 
[2019/03/13 20:34:04:3846] USER: LWS minimal ws client echo + permessage-deflate + multifragment bulk message
[2019/03/13 20:34:04:3846] USER:    lws-minimal-ws-client-echo [-n (no exts)] [-p port] [-o (once)]
[2019/03/13 20:34:07:9944] USER: LWS_CALLBACK_ESTABLISHED
[2019/03/13 20:34:07:9945] USER: LWS_CALLBACK_RECEIVE:    1 (rpp     0, first 1, last 1, bin 0, msglen 0 (+ 1 = 1))
[2019/03/13 20:34:07:9945] USER: LWS_CALLBACK_RECEIVE:    1 (rpp     0, first 1, last 1, bin 0, msglen 0 (+ 1 = 1))
[2019/03/13 20:34:07:9945] USER: LWS_CALLBACK_SERVER_WRITEABLE
[2019/03/13 20:34:07:9945] USER:  wrote 1: flags: 0x0 first: 1 final 1
[2019/03/13 20:34:07:9947] USER:  wrote 1: flags: 0x0 first: 1 final 1
[2019/03/13 20:34:07:9948] USER: LWS_CALLBACK_CLOSED
^C[2019/03/13 20:34:18:3978] USER: Completed failed

That's from connecting to the test server with this:

$ echo -n -e "GET / HTTP/1.1\r\nUpgrade: WebSocket\r\nConnection: Upgrade\r\nHost:127.0.0.1\r\nOrigin:http://127.0.0.1\r\nSec-WebSocket-Key: 0utCCN6fkRseaQM2OhH4OA==\r\nSec-Websocket-Version: 13\r\n\r\n\x81\x81\x7c\x50\x9c\x44\x76\x81\x81\xb7\xa0\xe8\xfb\xbd" | nc 127.0.0.1 7681
HTTP/1.1 101 Switching Protocols
Upgrade: WebSocket
Connection: Upgrade
Sec-WebSocket-Accept: 6jNViOzBhe2Ix0SYqLewDTM1Ofg=

�
�
GeorgeShagov commented 5 years ago

I am having some prbs in compiling: minimal-ws-server-echo

george@latoken-me2-pilot:~/prjs/libwebsockets-3.1.0/minimal-examples/ws-server/minimal-ws-server-echo$ cmake -DCMAKE_INCLUDE_DIRECTORIES_PROJECT_BEFORE=/home/george/prjs/libwebsockets-3.1.0/include . && make
CMake Error at CMakeLists.txt:56 (message):
  This project requires lws must have been configured with LWS_ROLE_WS
Call Stack (most recent call first):
  CMakeLists.txt:66 (require_lws_config)

-- Configuring incomplete, errors occurred!
See also "/home/george/prjs/libwebsockets-3.1.0/minimal-examples/ws-server/minimal-ws-server-echo/CMakeFiles/CMakeOutput.log".
See also "/home/george/prjs/libwebsockets-3.1.0/minimal-examples/ws-server/minimal-ws-server-echo/CMakeFiles/CMakeError.log".

Does it mean I should rebuild all the library with LWS_ROLE_WS option?

GeorgeShagov commented 5 years ago

I think I have pushed that through, yet I have encountered another issue:

george@latoken-me2-pilot:~/prjs/libwebsockets-3.1.0/minimal-examples/ws-server/minimal-ws-server-echo$ cmake -DLWS_WITHOUT_SERVER=0 -DLWS_WITHOUT_EXTENSIONS=0 . && make
-- The C compiler identification is GNU 7.3.0
-- The CXX compiler identification is GNU 7.3.0
-- Check for working C compiler: /usr/bin/cc
-- Check for working C compiler: /usr/bin/cc -- works
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Detecting C compile features
-- Detecting C compile features - done
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test HAS_LWS_ROLE_WS
-- Performing Test HAS_LWS_ROLE_WS - Success
-- Performing Test HAS_LWS_WITHOUT_SERVER
-- Performing Test HAS_LWS_WITHOUT_SERVER - Failed
-- Performing Test HAS_LWS_WITHOUT_EXTENSIONS
-- Performing Test HAS_LWS_WITHOUT_EXTENSIONS - Success
CMake Error at CMakeLists.txt:58 (message):
  Lws configuration of LWS_WITHOUT_EXTENSIONS is incompatible with this
  project
Call Stack (most recent call first):
  CMakeLists.txt:68 (require_lws_config)

-- Configuring incomplete, errors occurred!
See also "/home/george/prjs/libwebsockets-3.1.0/minimal-examples/ws-server/minimal-ws-server-echo/CMakeFiles/CMakeOutput.log".
See also "/home/george/prjs/libwebsockets-3.1.0/minimal-examples/ws-server/minimal-ws-server-echo/CMakeFiles/CMakeError.log".
GeorgeShagov commented 5 years ago

I compiled, yet in a hacky way, I have modified CMakeLists.txt


set(requirements 1)
require_lws_config(LWS_ROLE_WS 1 requirements)
#require_lws_config(LWS_WITHOUT_SERVER 0 requirements)
#require_lws_config(LWS_WITHOUT_EXTENSIONS 0 requirements)

here are the results:

Server:

george@latoken-me2-pilot:~/prjs/libwebsockets-3.1.0/minimal-examples/ws-server/minimal-ws-server-echo$ ./lws-minimal-ws-server-echo
[2019/03/13 15:14:16:2475] USER: LWS minimal ws client echo + permessage-deflate + multifragment bulk message
[2019/03/13 15:14:16:2475] USER:    lws-minimal-ws-client-echo [-n (no exts)] [-p port] [-o (once)]
[2019/03/13 15:14:16:2476] NOTICE: Creating Vhost 'default' port 7681, 1 protocols, IPv6 off
[2019/03/13 15:15:15:2301] USER: LWS_CALLBACK_ESTABLISHED
[2019/03/13 15:15:15:2302] USER: LWS_CALLBACK_RECEIVE:    1 (rpp     0, first 1, last 1, bin 0, msglen 0 (+ 1 = 1))
[2019/03/13 15:15:15:2302] USER: LWS_CALLBACK_RECEIVE:    1 (rpp     0, first 1, last 1, bin 0, msglen 0 (+ 1 = 1))
[2019/03/13 15:15:15:2302] USER: LWS_CALLBACK_SERVER_WRITEABLE
[2019/03/13 15:15:15:2302] USER:  wrote 1: flags: 0x0 first: 1 final 1
[2019/03/13 15:15:15:2303] USER:  wrote 1: flags: 0x0 first: 1 final 1

Client

echo -n -e "GET / HTTP/1.1\r\nUpgrade: WebSocket\r\nConnection: Upgrade\r\nHost:127.0.0.1\r\nOrigin:http://127.0.0.1\r\nSec-WebSocket-Key: 0utCCN6fkRseaQM2OhH4OA==\r\nSec-Websocket-Version: 13\r\n\r\n\x81\x81\x7c\x50\x9c\x44\x76\x81\x81\xb7\xa0\xe8\xfb\xbd" | nc 127.0.0.1 7681
HTTP/1.1 101 Switching Protocols
Upgrade: WebSocket
Connection: Upgrade
Sec-WebSocket-Accept: 6jNViOzBhe2Ix0SYqLewDTM1Ofg=
GeorgeShagov commented 5 years ago

It is really odd.

lws-team commented 5 years ago

From your tcpdump the notable thing was one packet had coalesced two ws messages... it's legal, and lws handles it since it has a bytewise parsing state machine. But a) it's unusual...eg lws won't send like that since it (like most apps that care about latency) turns off nagle coalescing

https://en.wikipedia.org/wiki/Nagle%27s_algorithm

and b) Nagle is sensitive to time... after waiting a little bit, if another packet to the same destination isn't coming, it sends the first one as a singleton. So it's suspicious... of course it may also be unrelated and the problem elsewhere.

You can maybe hack your perl client to disable Nagle on the socket, using perl setsockopt and TCP_NODELAY and see if it makes any difference.

GeorgeShagov commented 5 years ago

I really appreciate your replies, Andy. It makes things clearer

My two cents:

... coalesced two ws messages... it's legal, and lws handles it

Yes, it works indeed, as I stated before if I put client <-> server in ping-pong like mode, i.e. one msg from client = 1 msg from srv - no issues, it is working fine w/o any delay, as I wrote before:

If I turn out the messages from the ring at the server side it forks fine no issues. No delay required

It seems like that the problem arises only in case ping-pong is broken due to the fact the server starts sending more then one message in reply. Keeping this in mind could it be that lws fails to proceed coalesced two ws messages in case server is responding (the same moment?) to the client. Could that be they to concur for the same block/buffer of memory I wonder?

You can maybe hack your perl client to disable Nagle on the socket, using perl setsockopt and TCP_NODELAY and see if it makes any difference

Thanks for the hint. It worth trying. I am not quite sure though I have a direct access to socket in AnyEvent::WebSocket::Client. I will check

One more quest, regarding: turns off nagle coalescing. Is this extension I wonder? Any chances in disabling the feature?

Thnx once again

lws-team commented 5 years ago

AFAIK in the various supported ways to serve, there's no contention possible.

There is a sort of "multi-use scratchpad buffer" in lws, but it's bound to a "per-thread", ie, per- service thread structure. Lws supports multiple service threads but each thread can only do done thing at a time in its own buffer.

That's why I have been trying to get something that can reproduce this just using the example code plus or minus some hacks. Because they are all only trying to do supported and reasonable things, and they don't have whatever code you have added that I can't see (I don't want to see it, I have enough going on...). In the case it can be reproduced, something is broken my side, but since I can reproduce it, I can surely fix it. If it can't be reproduced without your threading code, we have to wonder if that is doing something to cause it.

Maybe there's some way to hack that to remove any concurrency on your side, perhaps by removing these "forks" whatever they are so it still goes through the motions but does less stuff. Depending on if the problem goes away maybe you can get a clue.

Nagle coalescing or not doesn't change any data. It just changes when it's coming. But I believe lws is immune to receiving coalesced data, and the test yesterday shows that. So something else is needed to make your symptom.

lws-team commented 5 years ago

... btw you mentioned that the problem only comes when you echo something back... how are you doing that? It should only be writing anything from the service thread context's WRITABLE callback and after an on_writable callback from the service thread context.

GeorgeShagov commented 5 years ago

Ha! This is a hint indeed!

It should only be writing anything from the service thread context's WRITABLE callback

Nope, it is not. I am getting the message from the client at case LWS_CALLBACK_RECEIVE: and in this thread I am responding back with the reply using: n = lws_write(wsi_in, out + LWS_SEND_BUFFER_PRE_PADDING, len + 1, LWS_WRITE_TEXT);

But I have a number of separate threads that are getting msgs from Kafka and placing them into the ring, calling a cancel service, like this:

LOG_INFO << "Writing to vhd->ring: "                                   
         << amsg._jsonmsg;                                             
n = lws_ring_insert(vhd->ring, (const void *)&amsg, 1);                
if (n != 1)                                                            
{                                                                      
   LOG_ERROR << "Unable to insert into ring: "                         
             << amsg._jsonmsg;                                         
}                                                                      
else                                                                   
{                                                                      
   /*                                                                  
    * This will cause a LWS_CALLBACK_EVENT_WAIT_CANCELLED              
    * in the lws service thread context.                               
    */                                                                 
   lws_cancel_service(vhd->context);                                   
}                                                                      

pthread_mutex_unlock(&vhd->lock_ring);        /* } ring lock ------- */

Then these messages are processed in

case LWS_CALLBACK_EVENT_WAIT_CANCELLED:                           
   LOG_DEBUG << "LWS_CALLBACK_EVENT_WAIT_CANCELLED: WSI: " << wsi;
   if (!vhd)                                                      
   {                                                              
      break;                                                      
   }                                                              

   /*                                                             
    * When the "spam" threads add a message to the ringbuffer,    
    * they create this event in the lws service thread context    
    * using lws_cancel_service().                                 
    *                                                             
    * We respond by scheduling a writable callback for all        
    * connected clients.                                          
    */                                                            
   lws_start_foreach_llp(struct per_session_data__minimal **,     
                         ppss, vhd->pss_list)                     
   {                                                              
      lws_callback_on_writable((*ppss)->wsi);                     
   }                                                              
   lws_end_foreach_llp(ppss, pss_list);                           
   break;                                                         

And only after into:

case LWS_CALLBACK_SERVER_WRITEABLE:                                          
   {                                                                         
      LOG_INFO << "LWS_CALLBACK_SERVER_WRITEABLE: WSI: " << wsi;             
      pthread_mutex_lock(&vhd->lock_ring);        /* --------- ring lock { */

      pmsg = (SWSMsg *)lws_ring_get_element(vhd->ring, &pss->tail);          

and I am calling n = lws_write(wsi_in, out + LWS_SEND_BUFFER_PRE_PADDING, len + 1, LWS_WRITE_TEXT); to complete the message to the client.

Is this wrong?

lws-team commented 5 years ago

Nope, it is not. I am getting the message from the client at case LWS_CALLBACK_RECEIVE: and in this thread I am responding back with the reply using: n = lws_write(wsi_in, out + LWS_SEND_BUFFER_PRE_PADDING, len + 1, LWS_WRITE_TEXT);

First LWS_SEND_BUFFER_PRE_PADDING was shortened to LWS_PRE a long while ago, they are defined to the same thing but much easier to read LWS_PRE.

There are two sides to it, first you cannot assume that you can write something at any time, it's not true. Especially when you spammed it with packets. On master, it uses stretchy buflists now to hold accidental writes that couldn't be sent, but before that it only allowed for one accidental send in progress at a time.

What you should correctly do instead is ask for a callback on writable there, when the WRITABLE callback comes you

so with those guarantees you can confidently write your data.

It's not legal to directly do any of this from another thread. You can do what you like in threads in your code, but they must signal the service thead with cancel_service and call any other lws apis only from the service thread context.

GeorgeShagov commented 5 years ago

It's not legal to directly do any of this from another thread

No, I am convinced I am sending/writing to the socket only from one thread. No doubt in this:

2019-03-13 11:44:45.079 INFO  [6172] [g_websocket_write_back_impl@20] wsi: 0x55d473ec7830; msg: {"cliOrderId":"perl-ws-cliordid-120","price":"1","type":"lmt","tradingPairId":29,"amount":"0.001","side":"buy","rqst":"onew","orderId":"1552477485.079571.wstg1@0029.1","ug":13,"action":1,"userId":51983,"rspns":"ok"}
2019-03-13 11:44:45.511 INFO  [6172] [g_websocket_write_back_impl@20] wsi: 0x55d473ec7830; msg: {"id":"1552477484.017101.wstg1@0029.2","userId":51983,"currencyId":1,"amount":"10000.026","lockedByOrder":"351.037","lockedByWithdrawal":"0","fee":"0","result":"1"}

The first message is the response from ws server to client on LWS_CALLBACK_RECEIVE The second message is a message ws sends to client on LWS_CALLBACK_SERVER_WRITEABLE

From what you said it follows that there is only one place to send the message to the client and this is LWS_CALLBACK_SERVER_WRITEABLE. I will refactor the code, in addition this is a good chance to try thread pooling

Thank you for you hints, Andy I really appreciate this!

lws-team commented 5 years ago

Ok... please let me know how you get on here, but I'll close this for now since nothing for me to do.

GeorgeShagov commented 5 years ago

Hello I have changed the logic of ws server on LWS_CALLBACK_RECEIVE I am writing to the ring and then on LWS_CALLBACK_SERVER_WRITEABLE I am writing back to client. I am convinced now that there is only one piece of code that write back to the client and this is LWS_CALLBACK_SERVER_WRITEABLE It has changed nothing, I am still getting 1002 and bad cont error: 2019/03/27 14:37:08 Disconnected. Close code:1002; Close Reason: bad cont; Close Error:

lws-team commented 5 years ago

It has changed nothing,

It may not have impacted your symptom, but it has made your code robust against other problems.

My difficulty helping you is I want to reproduce it, preferably using minimal examples. Because in truth I only care about any bug or thing to improve in lws. Debugging user code just eats my time and I get nothing in return.

So if it feels to you like it's a problem in lws, let's focus on trying to deliver the problem to my doorstep using (mainly) my own pieces... I will certainly study something I can reproduce very closely.

GeorgeShagov commented 5 years ago

Yes, I am working on this, btw, does it mean anything to you:

[2019/03/28 10:58:55:2619] INFO: lws_cancel_service
[2019/03/28 10:58:55:2621] INFO: lws_cancel_service
[2019/03/28 10:58:55:2624] INFO: lws_cancel_service
[2019/03/28 10:58:55:2626] INFO: lws_cancel_service
[2019/03/28 10:58:55:2678] INFO: lws_cancel_service
[2019/03/28 10:58:55:2680] INFO: lws_cancel_service
[2019/03/28 10:58:55:2683] INFO: lws_cancel_service
[2019/03/28 10:58:55:2689] DEBUG: _lws_change_pollfd: wsi 0x559b33d95630: fd 88 events 5 -> 5
[2019/03/28 10:58:55:2690] DEBUG: _lws_change_pollfd: wsi 0x559b33d95630: fd 88 events 5 -> 1
[2019/03/28 10:58:55:2690] DEBUG: lws_handle_POLLOUT_event: 0x559b33d95630: non mux: wsistate 0x20000117, ops ws
[2019/03/28 10:58:55:2691] INFO: lws_issue_raw: ssl_capable_write (342) says 342
[2019/03/28 10:58:55:2692] DEBUG: _lws_change_pollfd: wsi 0x559b33d95630: fd 88 events 1 -> 5
[2019/03/28 10:58:55:2692] DEBUG: lws_read_h1: h1 path: wsi state 0x117
[2019/03/28 10:58:55:2692] DEBUG: lws_parse_ws: received 179 byte packet
[2019/03/28 10:58:55:2693] INFO: lws_ws_frame_rest_is_payload: using 2 of raw input (total 173 on offer)
[2019/03/28 10:58:55:2693] INFO: lws_cancel_service
[2019/03/28 10:58:55:2693] DEBUG: lws_ws_rx_sm: ws fragment length exhausted
[2019/03/28 10:58:55:2693] PARSER: spill on lws-minimal
[2019/03/28 10:58:55:2693] INFO: lws_parse_ws: lws_ws_rx_sm bailed 0
[2019/03/28 10:58:55:2694] INFO: lws_read_h1: lws_parse_ws bailed
[2019/03/28 10:58:55:2694] INFO: __lws_close_free_wsi: 0x559b33d95630: caller: lws_read_h1 bail
[2019/03/28 10:58:55:2694] DEBUG: rops_close_via_role_protocol_ws: sending close indication...
[2019/03/28 10:58:55:2694] DEBUG: lwsi_set_state(0x559b33d95630, 0x20000119)
[2019/03/28 10:58:55:2694] DEBUG: __lws_set_timeout: 0x559b33d95630: 5 secs (reason 24)
[2019/03/28 10:58:55:2695] DEBUG: _lws_change_pollfd: wsi 0x559b33d95630: fd 88 events 5 -> 5
[2019/03/28 10:58:55:2696] DEBUG: _lws_change_pollfd: wsi 0x559b33d95630: fd 88 events 5 -> 5
[2019/03/28 10:58:55:2696] DEBUG: sending close packet
[2019/03/28 10:58:55:2696] DEBUG:
[2019/03/28 10:58:55:2696] DEBUG: 0000: 03 EA 62 61 64 20 63 6F 6E 74                      ..bad cont
[2019/03/28 10:58:55:2696] DEBUG:

bad cont is 1002

2019/03/28 10:58:55 Disconnected. Close code:1002; Close Reason: bad cont; Close Error:

thx in advance

GeorgeShagov commented 5 years ago

it is here:

libwebsockets-3.1.0/lib/roles/ws/server-ws.c

                if (m < 0) {
                        lwsl_info("%s: lws_ws_rx_sm bailed %d\n", __func__,
                                  bulk);

                        return -1;
                }

may be there is a reason to recompile LWS_WITHOUT_EXTENSIONS

lws-team commented 5 years ago

I don't really understand, but...

LWS_WITHOUT_EXTENSIONS

this was literally the first thing I asked about, the only extension is permessage-deflate.

GeorgeShagov commented 5 years ago

I tried LWS_WITHOUT_EXTENSIONS - did not work.

I have managed to reproduce this, attaching the code. please find the attached file The detailed instruction how to reproduce is into README lws-1517-submitted_tar.zip

The flags I used to build libwebsockets: cmake -DCMAKE_INSTALL_PREFIX:PATH=$HOME -DLWS_WITH_PLUGINS=1 -DCMAKE_BUILD_TYPE=DEBUG ./../

Testing script included into the project run/perl

I really hope it will help

In my environment it is reproduced almost immediately after the start:

lws-1517/run/perl$ cat wss-orders.10000-20000.buys.txt | perl wss-cln.pl
2019/04/01 08:20:39 <={"cliOrderId":"perl-ws-cliordid-cxc-155-10036","price":"200","type":"lmt","tradingPairId":29,"amount":"0.0001","side":"buy","rqst":"onew","orderId":"1553859919.512375.wstg1@0029.1","ug":13,"action":1,"userId":51983,"rspns":"ok"}
2019/04/01 08:20:39 =>{"cliOrderId":"perl-ws-cliordid-cxc-155-10000","price":"200","type":"lmt","tradingPairId":29,"amount":"0.0001","side":"buy","rqst":"onew"}
2019/04/01 08:20:39 <={"cliOrderId":"perl-ws-cliordid-cxc-155-10036","price":"200","type":"lmt","tradingPairId":29,"amount":"0.0001","side":"buy","rqst":"onew","orderId":"1553859919.512375.wstg1@0029.1","ug":13,"action":1,"userId":51983,"rspns":"ok"}
2019/04/01 08:20:39 =>a
2019/04/01 08:20:39 <={"cliOrderId":"perl-ws-cliordid-cxc-155-10036","price":"200","type":"lmt","tradingPairId":29,"amount":"0.0001","side":"buy","rqst":"onew","orderId":"1553859919.512375.wstg1@0029.1","ug":13,"action":1,"userId":51983,"rspns":"ok"}
2019/04/01 08:20:39 =>a
2019/04/01 08:20:39 <={"cliOrderId":"perl-ws-cliordid-cxc-155-10036","price":"200","type":"lmt","tradingPairId":29,"amount":"0.0001","side":"buy","rqst":"onew","orderId":"1553859919.512375.wstg1@0029.1","ug":13,"action":1,"userId":51983,"rspns":"ok"}
2019/04/01 08:20:39 =>a
2019/04/01 08:20:39 <={"cliOrderId":"perl-ws-cliordid-cxc-155-10036","price":"200","type":"lmt","tradingPairId":29,"amount":"0.0001","side":"buy","rqst":"onew","orderId":"1553859919.512375.wstg1@0029.1","ug":13,"action":1,"userId":51983,"rspns":"ok"}
2019/04/01 08:20:39 =>a
2019/04/01 08:20:39 <={"cliOrderId":"perl-ws-cliordid-cxc-155-10036","price":"200","type":"lmt","tradingPairId":29,"amount":"0.0001","side":"buy","rqst":"onew","orderId":"1553859919.512375.wstg1@0029.1","ug":13,"action":1,"userId":51983,"rspns":"ok"}
2019/04/01 08:20:39 =>a
2019/04/01 08:20:39 <={"cliOrderId":"perl-ws-cliordid-cxc-155-10036","price":"200","type":"lmt","tradingPairId":29,"amount":"0.0001","side":"buy","rqst":"onew","orderId":"1553859919.512375.wstg1@0029.1","ug":13,"action":1,"userId":51983,"rspns":"ok"}
2019/04/01 08:20:39 =>{"cliOrderId":"perl-ws-cliordid-cxc-155-10001","price":"200","type":"lmt","tradingPairId":29,"amount":"0.0001","side":"buy","rqst":"onew"}
2019/04/01 08:20:39 <={"cliOrderId":"perl-ws-cliordid-cxc-155-10036","price":"200","type":"lmt","tradingPairId":29,"amount":"0.0001","side":"buy","rqst":"onew","orderId":"1553859919.512375.wstg1@0029.1","ug":13,"action":1,"userId":51983,"rspns":"ok"}
2019/04/01 08:20:39 =>a
2019/04/01 08:20:39 <={"cliOrderId":"perl-ws-cliordid-cxc-155-10036","price":"200","type":"lmt","tradingPairId":29,"amount":"0.0001","side":"buy","rqst":"onew","orderId":"1553859919.512375.wstg1@0029.1","ug":13,"action":1,"userId":51983,"rspns":"ok"}
2019/04/01 08:20:39 =>a
Use of uninitialized value in concatenation (.) or string at wss-cln.pl line 53, <STDIN> line 9.
2019/04/01 08:20:39 Disconnected. Close code:1002; Close Reason: bad cont; Close Error:
^C

Thnx in advance George

lws-team commented 5 years ago

Thanks... I don't really want to build, run and debug your app though.

What do I need to do to a minimal example to make it do enough of what your app does to demonstrate the problem?

GeorgeShagov commented 5 years ago

If you take a look at the code you will see it is a simplest ws example, yet re-factored in some way The main two protos I would say were:

  1. minimal-ws-server-threads and
  2. minimal-ws-server

there is no my code into the app. I assure you. The app is the simplicity itself

GeorgeShagov commented 5 years ago

I am inclined to think there is a prb with some buffer into libws. If I do it less aggressive in sending back responces to client it works for some seconds, then stuck, if I repeat the test it stuck immediately, if I restart the server it can work some seconds again. So I would guess some buffer gets full.

lws-team commented 5 years ago

there is no my code into the app.

Well, the archive is 1.6MB and it's

I appreciate you sent a detailed README along with it. But it's not what I really need to clearly understand if the problem is on lws side or not. And I don't have spare brain capacity to adopt that code right now.

AFAICS the point is the sending side spams a ringbuffer with some JSON every 3ms. The ringbuffer allows for 128K entries (!) so it's basically endless. Effectively, the ringbuffer is a NOP and it tries to spam its peer with a JSON packet every 3ms. It sounds like I can just hack minimal-ws-server-threads to spam as fast as it can write.

The perl end of it just echoes the ws it received AFAICS. It should be possible to replace that with one of the ws echo minimal examples.

If the problem is in lws, this setup I describe made entirely of lws pieces should probably show it, right?

lws-team commented 5 years ago

I am inclined to think there is a prb with some buffer into libws.

Is there any evidence that's incompatible with the problem actually being with the perl part?

lws-team commented 5 years ago

Here's a diff on master

lws-test-issue-1517.diff.txt

After building lws with the diff and -DLWS_WITH_MINIMAL_EXAMPLES=1, the minimal examples are built into ./bi in your build directory. You can run in one terminal (assuming the cwd is your build directory)

 $ ./bin/lws-minimal-ws-server

and connect to it from another terminal

 $ ./bin/lws-minimal-ws-client-echo

When it connects, the minimal ws server will spam your 228-byte JSON at the client as fast as it can endlessly.

The minimal-ws-client-echo will receive the JSON, and send it back to the server as fast as it can.

I didn't get any problems... can you try it on your setup, using the same diff and test procedure? Then maybe try your perl in place of the ws echo client?

GeorgeShagov commented 5 years ago

a bunch of c++-17

Nope: here is all I can see:

for (int i = 0; i < 5; i++)
{
   std::thread thrd = std::thread(g_kfk_thread, wsi, pss);
   thrd.detach();
}

a bunch of your code

My sincere apology, but it is not so.

  53  types.h
 894  module.h
1.7K  module.cpp - this is the code that puts msg into the ring
 760  main.cpp
 514  deps.h
 422  ws-loop.h
 11K  ws-loop.cpp - this is ws code
 135  kfk-thread.h
 774  kfk-thread.cpp

some complicated build system

it is too emotional I would guess,

some perl - which forms the peer for lws to talk to and seems necessary to get the symptom

Yes it is. we have got the same simptom with .NET client first

It sounds like I can just hack minimal-ws-server-threads to spam as fast as it can write. If the problem is in lws, this setup I describe made entirely of lws pieces should probably show it, right?

Not exactly, minimal-ws-server-threads - pure broadcast, the problem appears only in case client sends the message and receive the response from server. This is a bottle neck. I tried to do the contrary, I used echo-server sample and send not one but 5 messages - works fine, the same with my app.

GeorgeShagov commented 5 years ago

Once again I tried to spam with lws-minimal-ws-server doing 5 msgs, it forks fine The problem appears only in case there are two conditions:

lws-team commented 5 years ago

Whatever... I have provided the kind of test that is only lws stuff with only small changes to get the behaviour... this will get solved by either finding the problem in lws, or by being unable to get the problem any more without "the bad thing", where "the bad thing" is something outside of lws.

For the latter case, to see if it's the right path, we must try to remove everything outside of lws. If the problem is still coming, well, it looks like an lws problem then.

Not exactly, minimal-ws-server-threads - pure broadcast, the problem appears only in case client sends the message and receive the response from server.

Did I misunderstand something? The diff I sent produces the spamming action from the server side. The client side receives it and echoes it, but the server just drops the rx after parsing at ws layer. IIUI that should be enough to reproduce the problem where server's RX parsing of ws sees an illegal situation (the "cont" thing).

GeorgeShagov commented 5 years ago

the bad thing

I did not say anything about that. It looks emotional

The diff I sent produces the spamming action from the server side

It was sent while I was typing, let me check....

lws-team commented 5 years ago

echoing server spamming threads

In the diff posted, there is no threads or ringbuffer on the server... it just writes and writes again as soon as it is told the socket is writable.

That is a superset of the case where you were writing to echo something... it's accepting RX and spamming as fast as possible.

From the perspective of rx and tx actions, that is a more harsh test case, but I did not get any symptom.

GeorgeShagov commented 5 years ago

In the diff posted, there is no threads or ringbuffer on the server

Then I would guess it will be a waste of time in doing this, this case is working with no issues,

lws-team commented 5 years ago

I did not say anything about that. It looks emotional

Ehhh.... we can agree there is "a bad thing" somewhere, it's not an emotional observation. You shouldn't get the "cont" problem. But there's no clue right now if that bad thing is in lws or elsewhere, the reproducer has too many different parts. If there's only my code running, and bad things happen, we can be pretty sure the bad thing is in my code... that's also fine for me if we can show it, find it and fix it.

lws-team commented 5 years ago

Then I would guess it will be a waste of time in doing this, this case is working with no issues,

Logically, a thread feeding the ringbuffer with the same string over and over is the same as just sending the same string over and over.

Do we thing the problem is in the ringbuffer code then, and we send garbage?

How about try LWS_WRITE_BINARY in your code instead of LWS_WRITE_TEXT. Broadly they're the same, but actually clients can assume TEXT contains valid utf-8.

If the problem is actually on ringbuffer side, maybe it sends junk and the perl implementation objects somehow.

GeorgeShagov commented 5 years ago

How about try LWS_WRITE_BINARY in your code instead of LWS_WRITE_TEXT

Tried - the same

If the problem is actually on ringbuffer side, maybe it sends junk and the perl implementation objects somehow.

At the first. my apology, here is the valid code for LWS_CALLBACK_RECEIVE

case LWS_CALLBACK_RECEIVE:                          
   ((char *)in)[len] = '\0';                        
   LOG_INFO << "LWS_CALLBACK_RECEIVE: " << (char *)in  
            << "WSI: " << wsi                       
            << "pss: " << pss                       
   ;                                                
   g_lws_write_2ring(wsi, pss, std::string((char*)in));
   break;                                           

I had forgotten to g_lws_write_2ring(wsi, pss, std::string((char*)in)); this is the same function is being used for threads

If i do not instantiate threads on:

case LWS_CALLBACK_ESTABLISHED:                                       
   LOG_INFO << "LWS_CALLBACK_ESTABLISHED: wsi" << wsi                
            << "; pss: " << pss;                                     
   pss->ring = lws_ring_create(sizeof(struct SWSMsg), TGW_RING_DEPTH,
                               __minimal_destroy_message);           
   if (!pss->ring)                                                   
   {                                                                 
      LOG_ERROR << "Failed in ring creation";                        
      return 1;                                                      
   }                                                                 
   pss->tail = 0;                                                    
   pthread_mutex_init(&pss->lock_ring, NULL);                        
   LOG_INFO << "Pss lock Ring is initialized";                       

   /*                                                                
    * for each connection 5 threads                                  
    * are going to be started                                        
    *                                                                
    * formally we must close them                                    
    * on destroy, but this is just a test                            
    */                                                               
   for (int i = 0; i < 0; i++)                                       
   {                                                                 
      std::thread thrd = std::thread(g_kfk_thread, wsi, pss);        
      thrd.detach();                                                 
   }                                                                 
   LOG_INFO << "Threads are started";                                
   break;                                                            

Then everything forks fine - no issues, endlessly

GeorgeShagov commented 5 years ago

Hmm.... but for my code if I do:

case LWS_CALLBACK_RECEIVE:
   ((char *)in)[len] = '\0';                           
   LOG_INFO << "LWS_CALLBACK_RECEIVE: " << (char *)in  
            << "WSI: " << wsi                          
            << "pss: " << pss                          
   ;                  
   g_lws_write_2ring(wsi, pss, std::string((char*)in));
   g_lws_write_2ring(wsi, pss, std::string((char*)in));
   g_lws_write_2ring(wsi, pss, std::string((char*)in));
   g_lws_write_2ring(wsi, pss, std::string((char*)in));
   g_lws_write_2ring(wsi, pss, std::string((char*)in));
   break;

Them I am getting 1002 at the client.

Let me check ws-server-echo once again, I am convinced I did that, multiple messages in reply, it worked fine