.NET Client gets disconnected with 1002

GeorgeShagov commented 5 years ago

Hello Team! You are doing a great work indeed! There is a small (I hope) problem here. It stands like this: There is a linux (ubuntu 18.04) web-socket server, based on minimal-ws-server-threads example If I connect to the ws server from Mozilla WebSocket addon it works no issues. But if my colleagues are trying to connect using .NET WebSocket Client submitting number of messages in a row (like 100 every 50ms) the client gets disconnected with 1002 WebSocket error I have verbosed the logging to maximum:

int logs = LLL_USER | LLL_ERR | LLL_WARN | LLL_NOTICE
           | LLL_INFO    | LLL_PARSER                
                         | LLL_HEADER                
           | LLL_EXT | LLL_CLIENT                    
           | LLL_LATENCY                             
           | LLL_DEBUG                               
;

Yet I do not observer any errors into the logs. What we did we have extended the buffer size for a protocol:

#define LWS_PLUGIN_PROTOCOL_MINIMAL             \
   {                                            \
      "lws-minimal",                            \
      callback_minimal,                         \
      sizeof(struct per_session_data__minimal), \
      32768*1024,                               \
      0, NULL, 4096                             \
   }

And it helped! It was purely experimental trick. But it hasn't resolved the prb completely. If the client starts sending messages with no delay the server gets disconnected after 15-20 messages.

It all looks odd. I can provide logging, but they are huge and no errors into these, therefore it does not seem to be a worthy enterprise.

Can you help with this?

Thanks in advance George

sample message snippet from Mozilla's addon:


CLNT >> SRV: {"rqst":"auth","apikey":"key-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"}

SRV >> CLNT: {"rqst":"auth","apikey":"key-XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX","rspns":"ok"}

CLNT >> SRV: {"cliOrderId":"WSTestNew84","price":"1","type":"lmt","tradingPairId":29,"amount":"1","side":"sell","rqst":"onew"}

SRV >> CLNT: {"cliOrderId":"WSTestNew84","price":"1","type":"lmt","tradingPairId":29,"amount":"1","side":"sell","rqst":"onew","orderId":"1552393107.993768.wstg1@0029.2","ug":13,"action":1,"userId":51983,"rspns":"ok"}

SRV >> CLNT: {"id":"1552393107.993768.wstg1@0029.2","userId":51983,"currencyId":1,"amount":"10000","lockedByOrder":"140","lockedByWithdrawal":"0","fee":"0","result":"1"}

SRV >> CLNT: {"avgPrice":"0","cliOrderId":"WSTestNew84","execPrice":"1","initPrice":"1","message":"","oqty":"1","orderId":"1552393107.993768.wstg1@0029.2","qty":"1","side":"sell","stat":"accepted","tradingPairId":29,"trdId":"1552393109.603879.3.ug13","type":"lmt","ug":13,"userId":51983,"userIdMaker":0,"userIdTaker":0}

lws-team commented 5 years ago

Well it's 9pm here, I won't do any more today.

You can't write more than once per WRITABLE callback in lws.

After the first write, there's no guarantee the socket can be written for the subsequent ones.

So that's literally forbidden to do... as soon as it meets the situation that it wants to write 288 but the socket will only accept less, writing again will become disordered.

GeorgeShagov commented 5 years ago

You can't write more than once per WRITABLE callback in lws.

Sure g_lws_write_2ring(wsi, pss, std::string((char*)in)); - places the msg into the ring

lws-team commented 5 years ago

Oh... right. Then dunno why it should make a difference.

lws-team commented 5 years ago

You seem to be storing a wsi in the objects and using that to decide which wsi to write to.

inline int
g_lws_write_2client(const SWSMsg& in_amsg)
{
   if ((in_amsg.m_jsonmsg == NULL) || (in_amsg.m_wsi == NULL))
   {
      LOG_ERROR << "Invalid params.";
      return -1;
   }
   return lws_write(in_amsg.m_wsi
                    , in_amsg.m_jsonmsg + LWS_PRE
                    , in_amsg.m_len
                    , LWS_WRITE_TEXT
                    );
}

that's broken in a couple of ways

the wsi may be closed and destroyed any time you return to the event loop. The other side may hang up or crash, or an intermediate router go down or whatever. You can't store wsi unless you are monitoring everyone's CLOSE callbacks and remove invalidated ones (usually you still don't store wsi, CLOSE processing just removes the wsi from a "live wsi" list which is used to request callbacks on writable only).
you are getting a WRITEABLE callback for some wsi, but then you are calling the function above telling lws to write on a potentially unrelated wsi that it fishes out from the ringbuffer.

I didn't follow your code, so maybe it's the case that the wsi in the ringbuffer is always the wsi from the WRITABLE callback and it can't go wrong... great, but no need to put it in the fifo then.

GeorgeShagov commented 5 years ago

Hello Andy

I appreciate your reply.

You seem to be storing a wsi in the objects and using that to decide which wsi to write to.

Yes, this is true, indeed. I am aware of the issue but this to be non-related to the topic. At this moment I am convinced wsi into msg is correct:

[callback_minimal@432] LWS_CALLBACK_SERVER_WRITEABLE: WSI: 0x55f72fceabf0
[callback_minimal@434] pss: 0x55f72fce3e70; pss->tail: 0; wsi: 0x55f72fceabf0
[callback_minimal@445] Ring Free Elements: 131070
[callback_minimal@455] ; pmsg: 0x7f1d282f7010; pmsg->m_wsi: 0x55f72fceabf0

I do not experience any issues on closing session, this is an off-topic as I mentioned. The case is that client gets disconnected with 1002

this is very interesting: https://github.com/warmcat/libwebsockets/issues/1517#issuecomment-478568981 This means that the threads are not related to the topic, it is possible to reproduce the case only duplicating msgs into the ring on LWS_CALLBACK_RECEIVE

GeorgeShagov commented 5 years ago

And... I wanted to thank you, Andy for your attention. I really appreciate that. I think, no I am convinced you are doing a great job, looking at other's vendor's libraries I think libwebsockets is the best one

GeorgeShagov commented 5 years ago

Hello

I tried to reproduce the case using the minimal samples, namely, minimal-ws-server-echo and I cannot do that, it forks fine

here is what I did on LWS_CALLBACK_RECEIVE

      for (i=0; i<5; i++) {
          amsg.len = len;
          /* notice we over-allocate by LWS_PRE */
          amsg.payload = malloc(LWS_PRE + len);
          if (!amsg.payload) {
              lwsl_user("OOM: dropping\n");
              break;
          }

          memcpy((char *)amsg.payload + LWS_PRE, in, len);
          if (!lws_ring_insert(pss->ring, &amsg, 1)) {
              __minimal_destroy_message(&amsg);
              lwsl_user("dropping!\n");
              break;
          }
      }
      lws_callback_on_writable(wsi);

I think that what could be due to different compiler and linked libraries the socket could have a different options

Is there any way to know and control what socket options are in use from libws?

I want to compare the socket options in minimal samples and that are being used in my app.

thx in advance

lws-team commented 5 years ago

The socket options are set by platform, for *nix it's here

https://libwebsockets.org/git/libwebsockets/tree/lib/plat/unix/unix-sockets.c#n76

Looking at what the perl code did, I am not sure but a major difference is it looked blocking. So it would wait for some rx and then try to send the tx. It can't deal with more rx while it is blocked waiting for the socket to become writeable, the whole thread is blocked.

In lws, it's all nonblocking. So the equivalent code in lws OK dealing with multiple rx inbetween being able to send, or multiple sends before any new rx comes.

GeorgeShagov commented 5 years ago

Thnx, noted Is there any chance to switch off flowcontrolled for wsi?

lws-team commented 5 years ago

I am not sure what you mean by it.... lws has a thing 'rx flow control' where basically you can turn off POLLIN waiting manually. But it's manual, there is no special 'flowcontrol' action to 'switch off' otherwise.

The use of disabling rx is that it produces 'tcp backpressure', in that the tcp window fills up and eventually the remote sender no longer gets POLLOUT indications and stops sending. Ie it makes the sender adapt to the reality it the receiver isn't acking the data already in flight.

If you mean 'how do I make it use rx flow control'...

https://libwebsockets.org/git/libwebsockets/tree/include/libwebsockets/lws-misc.h#n722-742

GeorgeShagov commented 5 years ago

Hello Andy

I appreciate your time. I have spent a lot of mine investigating the case. What I did I have moved the code from minimal-ws-server-echo to my project. Formally speaking I have build minimal-ws-server-echo using g++ and the headers I needed and.... problem immediately appeared. It is very easy to reproduce. But it is working fine if the code is compiled using gcc

Do you use some special gcc features like structure size alignments for instance? or anything else?

I can submit (once again) the code I am using.

lws-team commented 5 years ago

How about memory barriers?

https://libwebsockets.org/git/libwebsockets/tree/lib/core/private.h#n153-159

Lws only uses them to detect a situation that's not recommended any more, a different thread calls on_writable while the service loop is in poll().

https://libwebsockets.org/git/libwebsockets/tree/lib/plat/unix/unix-service.c#n92

... it knows how to do them in gcc and clang. They are necessary to make different cores agree on the same view of memory under those conditions. Under bsd type OS, changes to poll events from other threads while sleeping are wiped out on exit from poll()... this allows us to queue such changes and apply them on exit from poll().

For the last few versions it's recommended to use instead (and all the minmal examples use) lws_cancel_service() from the other thread, and do the pollfd changes from the callback in lws service thread context. This has a few advantages, it doesn't need the memory barriers, and it works on any event loop not just poll. Eg

https://libwebsockets.org/git/libwebsockets/tree/minimal-examples/ws-server/minimal-ws-server-threads-smp/protocol_lws_minimal.c#n120

GeorgeShagov commented 5 years ago

Andy I must apologize. I think I have found the issue and that was in our code that caused memory corruption in case of coalesced messages. The shame is all mine, I understand you have spent a time. My apology once again.

What I did I have moved the code from minimal-ws-server-echo to my project. Formally speaking I have build minimal-ws-server-echo using g++ and the headers I needed and.... problem immediately appeared. It is very easy to reproduce. But it is working fine if the code is compiled using gcc

Nope, the code is working fine for g++ and std=c++17

lws-team commented 5 years ago

No worries... over the years doing this I found if it's an lws issue, it should be coming if using only lws pieces... perhaps lightly hacked to test the right scenario. Usually in that case with the problem in lws, more lws makes the problem come worse. If it cannot be reproduced like that, despite effort to align it with the scenario, while it doesn't "prove" no lws issue, generally it will turn out the problem is coming from user code.

Fixing user code as I have done in the past (there are many thousands of users...) turns out not to buy me fame, fortune and code contributions, just "kthxbye". So unless there's some strong reason, the effective way to deal with questions about supposed lws bugs is by strictly using lws pieces and reject running / debugging user code.

GeorgeShagov commented 5 years ago

Yes, I do understand and do respect both your position and what you are doing. It is very likely we are going to deploy lws based trading gateway here in LATOKEN, this could be an addition record in your client's list

warmcat / libwebsockets

.NET Client gets disconnected with 1002 #1517