Official echo server example

ghost commented 8 years ago

Hi Andy,

I need to make an "official" echo server in libwebsockets for use in benchmarks. Do you have any finished echo server that I can use, or should I make one myself?

ghost commented 8 years ago

Obviously, you have the libwebsockets-test-echo but this server is not working - it times out and has problems and it is using poll so it doesn't scale. Do you have one example that is targeted for performance testing that makes use of epoll?

ghost commented 8 years ago

I get this when connecting 1000 users using a WebSocket++ client that is known to work with other servers:

[alexhultman@localhost bin]$ ./libwebsockets-test-echo
[2016/04/19 19:51:00:0020] NOTICE: Built to support client operations
[2016/04/19 19:51:00:0020] NOTICE: Built to support server operations
lwsts[30433]: libwebsockets test server echo - license LGPL2.1+SLE
lwsts[30433]: (C) Copyright 2010-2016 Andy Green <andy@warmcat.com>
lwsts[30433]: Running in server mode
lwsts[30433]: Initial logging level 7
lwsts[30433]: Libwebsockets version: 2.0.0 alexhultman@localhost.localdomain-v2.0.0-49-g7c2d596
lwsts[30433]: IPV6 not compiled in
lwsts[30433]: libev support not compiled in
lwsts[30433]: libuv support not compiled in
lwsts[30433]:  Threads: 1 each 1500000 fds
lwsts[30433]:  mem: platform fd map: 12000000 bytes
lwsts[30433]:  Compiled with OpenSSL support
lwsts[30433]:  SSL disabled: no LWS_SERVER_OPTION_DO_SSL_GLOBAL_INIT
lwsts[30433]: Creating Vhost 'default' port 7681, 1 protocols
lwsts[30433]:  Listening on port 7681
lwsts[30433]:  mem: per-conn:          488 bytes + protocol rx buf
lwsts[30433]:  canonical_hostname = localhost.localdomain
lwsts[30433]: lws_protocol_init
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: b63 of length must be zero
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: Control frame with xtended length is illegal
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: Control frame with xtended length is illegal
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: Control frame with xtended length is illegal
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: Control frame with xtended length is illegal
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: ERROR -1 writing to socket, hanging up
lwsts[30433]: error on reading from skt : 104
lwsts[30433]: Control frame with xtended length is illegal
lwsts[30433]: Control frame with xtended length is illegal
lwsts[30433]: Control frame with xtended length is illegal
lwsts[30433]: error on reading from skt : 104
lwsts[30433]: error on reading from skt : 104
lwsts[30433]: error on reading from skt : 104
lwsts[30433]: Control frame with xtended length is illegal
lwsts[30433]: Control frame with xtended length is illegal

ghost commented 8 years ago

I still cannot get the libwebsockets-test-echo server to work reliably. I have 10 connections sending small 20 byte messages and it only works for a couple of iterations then it stops working. It would be nice if you could provide an echo server that is minimal and stable so that performance tests can be conducted. This is the 2.0 release.

It works reliably if I have 1 or 2 connections sending only one frame + payload per packet, as soon as I have 3 or more connections or send anything other than one frame + payload per packet it becomes unreliable and stops echoing back.

lws-team commented 8 years ago

Echo got heavily changed to work with Autobahn for client, I haven't used it for anything but casual tests on server since then.

It's a test app that just shows how to echo data, it isn't some kind of streamlined performance beast.

Atm I am working on other features I will look at this soon, but I get a pain in my heart when I think about "performance tests". A few weeks ago I looked at your project and I learned lws uses "16x (or some such huge number, I forget) as much memory" as you do, which evidently ignores node.js completely and since lws is standalone, is therefore thoroughly misleading.

ghost commented 8 years ago

Yeah it can be clarified a bit more, but I'm only trying to map the current WebSocket landscape since there are some projects which go by marketing rather than reality (recently I found Kaazing extremely inefficient, yet they claim "unprecedented" scalability).

I just want to make sure I use server code that represent recommended usage of the lib of each test.

lws-team commented 8 years ago

I think if you want your numbers to be taken seriously, you cannot go around claiming 16x less memory than this, 50x less memory than that when it doesn't reflect the reality. That is itself "going by marketing rather than the reality".

ghost commented 8 years ago

What numbers should I put?

ghost commented 8 years ago

I'm just trying to map out what servers scale extremely bad and ehich servers scale reasonably. If we take ws as example, this server will fill your entire memory at 500k connections while for instance lws can reach about 2 million connections on the same memory. So clearly there are huge differences in projects. Kaazing scaled about the same or even worse than ws.

lws-team commented 8 years ago

Those numbers seem to actually be comparing apples with apples, ie, how much you can do with the same system memory. So why not put those numbers? 4x is certainly more believeable and defensible than 50x.

Even then there are many caveats even with lws, eg, lws has extensions enabled by default, and some of these other solutions are not capable of extensions / permessage-deflate. lws supports ws client or cgi or ssl or extensions, all of which affect the memory footprint. But of course if you want mixed client / server, that does not represent a problem but a big benefit.

I have to do things IRL atm, sorry.

lws-team commented 8 years ago

Kernel usage per process is very difficult to measure... in a system with virtual memory + shared library objects resident vm numbers becomes kind of uncertain. Your idea of seeing how many connections can fit in the same (real) memory should cut through all that and make other allocations reveal themselves in the way they reserve memory you can't have for the userland app. I think that's a useful number for people who are wanting to squeeze maximum utility from their server, although even then they have to check carefully what features they are losing to get that bigger number and if they actually want to go down that road.

The problem is not kernel / user memory... anyone with a socket is going to largely make the same kernel footprint +/- socket options.

The problem so far though is you just seem to measure memory usage of your little bit of the picture on top of node.js, and ignore what node.js costs underneath. Lws and maybe some of the other solutions are not designed to be a little bit of top of node.js; lws has a full ws http upgrade parser, all the ssl stuff duplicating node.js doing it - and extensions support. The problem there is until now you are reducing an apples - oranges comparison to one meaningless number that sounds unbelievably awesome until you look closer, and see it actually is unbelievable.

Your other method of max conns - if you make them the same as possible functionality wise so they can be compared - in the same memory sounds much more useful.

ghost commented 8 years ago

The idea was, if you have 1 connection making use of 5mb kernel memory and 3 byte user memory and compare this to 1 connection with the same kernel memory usage but 1 byte user memory then my benchmark will report a 3x difference which is a little bit scewed since anyone can see they both require about the same IRL. This has not been an issue with the other servers since they use by far more user memory than kernel memory, making the kernel memory neglictable. So by including this the difference with lws would go down a bit because of ratio change.

Other than this, you don't need to think about Node.js, permessage-deflate or SSL. I made sure not to use any of these in any test and there is really no dependency on Node.js at all. The lib implement all these features by itself so there is no "feature loss" being the reason for the memory difference.

lws-team commented 8 years ago

If you use this scheme you came up with of "how many connections can I get with the physical memory I have", you are automatically measuring any kind of kernel allocation, since guys doing that will run out of VM for the user process quicker (it's tied up on kernel side). You don't have to explicitly measure it, it will reveal itself. And that should be reproducible even on different systems and different kernels, if it's caused by what you think it is. So that sounds like a very good, defensible choice if you are trying to capture memory usage. (Of course it just measures that, not all the other differences which may outweigh memory usage for a particular user).

If it's not related to what you were doing with node.js, I have no idea where you got the big multiple numbers like 16x less memory you have been showing until now. That has to be a problem of test methodology one way or the other.

ghost commented 8 years ago

Node.js is not involved in this, the lib is a C++ lib with (optional) Node.js bindings. I know for a fact that I can get at least 6x more connections on the same machine so I don't see the reason for mud slinging. I'm only interedted in asking you to provide me with a reliable echo server so that I can test it. There is absolutely room for improvements in the memory usage test but things are not just made up out of thin air. I did not just wake up out of an LSD trip and shout "60x". I have tested these servers for the last 4 months so I know what is legit and what is not.

lws-team commented 8 years ago

Your "benchmarks" which you have published claim "16x less memory" than lws to do something, and 50x less memory for something else, IIRC.

Maybe I took the LSD instead of you, but for the same functionality in LWS, I don't see how you can arrive at that figure. Can you explain how you measured that? A struct lws might be say 320 bytes for serving, you're saying your equivalent is 20 bytes? It's possible, but it will not have equivalent functionality.

JoakimSoderberg commented 8 years ago

Randomly jumping into the conversation just because.

Regarding performance, the autobahn test suite has some kind of performance tests as well: http://autobahn.ws/testsuite/usage.html?highlight=wsperf#mode-wsperfcontrol

However, I have never tried them, or know the current state of them.

But that might not be relevant in your memory discussion :)

ghost commented 8 years ago

Autobahn is written in Python and cannot possibly stress lws or uws. I had to write my own super optimized client to even achieve 100% CPU usage in uws (and lws).

The biggest difference in memory use in lws and uws is the user space buffer of size rx_buffer_size which is 4kb by default. This adds up to 4 gb of ram per millionth connection. Then we have the shared vector of FDs that require a couple of hundreds of mb of memory. I don't know if you can optimize there away, maybe you can, but currently these buffers are one of the biggest differences.

JoakimSoderberg commented 8 years ago

ok, but the performance tests as I understand it are written in C++ ... but yea might not be relevant in this discussion at all :)

Just any work related to benchmarking different Websocket implementations would probably be nice if they somehow got tied to the autobahn testsuite, since it is used to verify a lot of the implementations. Something to consider.

but I'll let you go on :)

lws-team commented 8 years ago

The particular numbers alex published are about memory usage, autobahn will be like ab, measuring connection performance. Those two things are orthogonal tradeoffs, you can burn memory to get better throughput or vice versa.

In lws the per-connection rx_buffer_size is configurable per-protocol. If you don't set it, it's 4KB by default, but... if low memory was the target, you'd set it to some smaller size you could live with. Or if throughput was the goal, make it bigger. That's why it's configurable.

ghost commented 8 years ago

Anyways - I'm not in for the mud slinging. We have different use cases (I gues lws is more of an embedded lib) and yes, I taint away a lot of performance by touching Node.js with the addon but I just want to paint a picture of the variations between different libraries.

There are many libraries which scale astronomically bad (jaw dropping, not kidding) which still claim to hold "unprecedented" scalability and performance. I just want to show that this is not the case.

ghost commented 8 years ago

Yes you could change the default, you can do that in most libraries. I test the default settings because I'm doing an overall check of the initial state. Most users just download the lib and run it, and even if you change the rx_buffer_size to 512 bytes it is still going to add up and you lose performance which you don't have to lose in uws because I don't hold any user space buffer. That is the point I'm trying to make - I aim to create a well rounded lib with best scalability and best performance out of the box.

ghost commented 8 years ago

This is also why I ask for an "official" echo server because then you as the expert of this lib get to decide what settings to use and how the echo server is written. Had I just wanted to come up with some crazy number I would have just run the poll (not epoll) version of your echo server. My aim is to be transparent in this.

lws-team commented 8 years ago

"mud-slinging" looks quite different, it, you know, involves mud.

If you don't want your claims and methodologies to be challenged, it's best to go out of your way to make the tests equivalent, fair and reproducible. You had a very good idea earlier in the thread about capturing the limit of connections in the same memory, but even then if you arbitrarily reduced your per-connection rx or don't offer per-connection rx / flow control, but left the other libraries configured for defaults, that is a big difference in functionality that is relevant to what is being tested.

Also... you know if you go back 5 years in lws git, you get something similar, much more lightweight and less features. Actually the features, like extension support, have their uses.

Anyway it's a big field, even just ws, there are many libraries including c++ ones already available. Let's see if we can compete on merits not 'mud slinging' on other projects.

lws-team commented 8 years ago

What should I set my default buffers for? Large for high throughput it looks like crap for memory. Small for low memory it looks like crap for throughput. Actually you must configure it appropriately to make the test fair. But you don't seem to want to do that.

Not having any user space buffer sounds awesome but it means you have no rx flow control. When you get into real applications, some things break without it.

ghost commented 8 years ago

Exactly, you are correct. By changing the settings you will only be shifting the performance from one dimension to another. If you lower the buffer size you will get better memory usage but lower throughput. Same goes for the opposite.

This is exactly my point with the benchmark. I want to measure 4 dimensions using the same settings so that I show how the server performs using the same deployment. The benchmark is one entity - all numbers are in a relation like in a linear system of equations. If you change one column you will affect the other columns. This is like an optimization in a 4 dimension space, not like an optimization in 4 separate 1 dimensional spaces.

You put too much emphasis on the memory test. The memory test is in relation to the performance test. In a real world case you want good performance but also scalability. If I would test each category separately one could easily get whatever values they wanted. You could set 1 byte buffer and get good memory scalability and them move on to the next test and change the settings to get better throughput.

If you were Google, and you wanted to minimize memory usage and CPU usage then you need to find an optimum to this problem that satisfies these two dimensions, not just one at a time.

lws-team commented 8 years ago

Lws is designed for standalone web + ws function, in embedded cases, it's not designed to be google. Actually I spent quite a while on hybi list actually with google when ws was being defined and they cared about http / ws mux (which became http/2 via spdy) and extensions; the features in http/2 like tx credit mandate rx flow control / rx buffers by the way.

ws itself is designed to support different functionalities in the user-defined protocols.

When the users design their protocol, there's a range for the number for buffer size implied by what the protocol does, and users pick from within that range depending on what they want to optimize. So it fits well you can set buffer size per-protocol (and get robust rx flow-control that works with extensions).

I concentrate on memory because that's where I read your claims. As I point out with lws, the user can configure and trade off memory / performance per-protocol.

Anyway it's bedtime for me. Have a nice evening when it comes around.

ghost commented 8 years ago

I do not agree that having performance OR scalability on a per-protocol basis is good enough. I know that for my own usage of WebSockets it would not suffice.

Again - I'm just trying to measure different libraries, I do not want any flame and I do not agree on the claims about "broken flow control" and such bogus nonsense. I'm just trying to be open about my process and show how I do things. You toss quite the amount of mud on my project and claim I'm not open about my benchmarks and I'm just making numbers up yet you do not actually know about the internal differences (obviously you did not know about the rx_buffer_size, yet you started the mud slinger beforehand).

All my benchmarks are open source and they are accompanied by documentation and the used echo servers (except for lws). I could certainly improve the memory measurement but it still needs to be easy to run. If I get somewhat inaccurate numbers or not is not really a major concern as you could really get any kind of numbers you wanted by changing the kernel buffer size for sockets anyways. I measure the user space memory usage after creating hundreds of thousands of connections and divide by number of connections. This can be improved, yes, but the main point of the benchmark still holds and is still true (even if the scale of the difference might be somewhat off).

lws-team commented 8 years ago

Well, I wanted to leave this open to track whatever's up with echo server, but it's clearly a waste of time. Eventually as you implement more things, you'll understand why rx flow control is important.

Good luck with your project, but you should maybe just work on your project.

warmcat / libwebsockets

Official echo server example #533