telefonicaid / fiware-orion

Context Broker and CEF building block for context data management, providing NGSI interfaces.
https://fiware-orion.rtfd.io/
GNU Affero General Public License v3.0
211 stars 265 forks source link

Broker to use poll and not select #2724

Closed kzangeli closed 7 years ago

kzangeli commented 7 years ago

We need to modify the broker (restInit in rest.cpp) so that MHD uses poll and not select. Select only supports 1024 simultaneous connections and some configurations might need more than that. Also, there is a possibility we gain some performance using poll/epoll instead of select.

A few unfruitful attempts have been made, but select is still used (check done with strace contextBroker -fg) and not poll ...

fgalan commented 7 years ago

We should take into account not only MHD but also libcurl

fgalan commented 7 years ago

Maybe the usage of -maxConnection CLI should be changed after implementing this issue. The following thread provides extra information regarding the MHD_OPTION_CONNECTION_LIMIT parameter in which -maxConnection is (currently, let's see in the future with epoll()) based http://lists.gnu.org/archive/html/libmicrohttpd/2016-11/msg00014.html

kzangeli commented 7 years ago

About libcurl, there seems to be a way to ask libcurl to use poll and not select, however, we will need to rewrite all the curl code, as we right now use 'easy_curl', which only supports select(). We need to use'multi_curl':

https://curl.haxx.se/libcurl/c/libcurl-multi.html

mariolg commented 7 years ago

MHD_USE_POLL Use poll() instead of select(). This allows sockets with descriptors >= FD_SETSIZE. This option currently only works in conjunction with MHD_USE_THREAD_PER_CONNECTION or MHD_USE_INTERNAL_SELECT (at this point). If you specify MHD_USE_POLL and the local platform does not support it, MHD_start_daemon will return NULL.

MHD_USE_EPOLL_LINUX_ONLY Use epoll() instead of poll() or select(). This allows sockets with descriptors >= FD_SETSIZE. This option is only available on Linux systems and does not work in conjunction with MHD_USE_THREAD_PER_CONNECTION (at this point). If you specify MHD_USE_EPOLL_LINUX_ONLY and the local platform does not support it, MHD_start_daemon will return NULL. Using epoll() instead of select() or poll() can in some situations result in significantly higher performance as the system call has fundamentally lower complexity (O(1) for epoll() vs. O(n) for select()/poll() where n is the number of open connections).

Due to the fact that MHD_USE_THREAD_PER_CONNECTION is mandatory with poll then the CLI option -reqPoolSize. Size of thread pool for incoming connections. Default value is 0, meaning no thread pool => will not have sense, I guess.

But it will do with epoll because it does not work in conjunction with MHD_USE_THREAD_PER_CONNECTION.

fgalan commented 7 years ago

Relevant thread at libmicrohttpd mailing list: http://lists.gnu.org/archive/html/libmicrohttpd/2016-11/msg00020.html

fgalan commented 7 years ago

About libcurl, there seems to be a way to ask libcurl to use poll and not select, however, we will need to rewrite all the curl code, as we right now use 'easy_curl', which only supports select(). We need to use'multi_curl':

https://curl.haxx.se/libcurl/c/libcurl-multi.html

Recent research on the topic shown that libcurl (in the way we use at CB) doesn't use select internally, as we have reproduce a case with 4000 outgoing notification connections (and 4000 is greater than 1024, the limit with select).

Thus, the only limit seems to be at MHD.

fgalan commented 7 years ago

Implementation done at PR https://github.com/telefonicaid/fiware-orion/pull/2751. However, documentation at perf_tuning.md should be improved based in what we have learnt and implemented, so this issue will remain open yet a little bit.

fgalan commented 7 years ago

Documentation completed in PR https://github.com/telefonicaid/fiware-orion/pull/2755

fgalan commented 7 years ago

Assigning to @iariasleon for QA validation.

fgalan commented 7 years ago

Doc still pending on this http://lists.gnu.org/archive/html/libmicrohttpd/2016-12/msg00022.html

iariasleon commented 7 years ago

Bug detected: If try to start by service (service contextBroker start) and the config used is -notificationMode threadpool:60000:1022 (1022 or higher), the CB does not start, showing in log this line:

time=2016-12-12T10:13:08.340Z | lvl=FATAL | corr=N/A | trans=N/A | from=N/A | srv=N/A | subsrv=N/A | comp=Orion | op=rest.cpp[1547]:restStart | msg=Fatal Error (error starting REST interface)

If the CB is started by command (/usr/bin/contextBroker ....) it is started successfully. No matter if -fg is used or not in this case.

fgalan commented 7 years ago

My bet: the thread limit is not being honoured when CB runs as a service. If I'm right, then a procedure to set thread limit for processes running as services would solve this problem.

iariasleon commented 7 years ago

Coment related to https://github.com/telefonicaid/fiware-orion/issues/2724#issuecomment-266398438

The contextBroker should be started by command (/usr/bin/contextBtoker ...) instead of service (service contextBroker start), because the VMs are limited the number of threads (1024) in all users except in root. See https://bugzilla.redhat.com/show_bug.cgi?id=919793 .

iariasleon commented 7 years ago

LGTM

Test: https://github.com/telefonicaid/fiware-orion/tree/master/test/loadTest/connections_stress_tests

Summary:

Test configuration:
   service: stablished_connections
   servicePath: /test
   CB endpoint: http://qa-orion-fe-01:1026
   notification URL: http://qa-orion-fe-02:8090/notify
   mongo host: qa-bigdata-sth-02
   test duration: 60 minutes (3600 seconds)
   version requests delay: 1 seconds
   max subcription: 5000
   noEstablished flag: False
   noQueueSize flag: False
  ***************************************************************************************
  *  verify if the listener has a delay in the response (10 minutes recommended)        *
  *  verify if these parameters are used in CB config:                                  *
  *           -httpTimeout 600000 -notificationMode threadpool:60000:5000               *
  ***************************************************************************************
  The database orion-stablished_connections has been erased
  creating 5000 subscriptions...
  5000 subscriptions have been created
Test init: 2016-12-12T16:22:15.851000Z 
Reports each second:
    counter     version      queue   established
                    request      size    connections
----------------------------------------------------------
-------- 1 -------- OK -------- 0 -------- 1014 ----------
-------- 2 -------- OK -------- 0 -------- 1014 ----------
-------- 3 -------- OK -------- 0 -------- 1014 ----------
-------- 4 -------- OK -------- 0 -------- 1014 ----------
...
-------- 326 -------- OK -------- 0 -------- 1014 --------
-------- 327 -------- OK -------- 0 -------- 1014 --------
-------- 328 -------- OK -------- 0 -------- 12 ----------
-------- 329 -------- OK -------- 0 -------- 10 ----------
-------- 330 -------- OK -------- 0 -------- 10 ----------
-------- 331 -------- OK -------- 0 -------- 10 ----------
-------- 332 -------- OK -------- 0 -------- 10 ----------
-------- 333 -------- OK -------- 0 -------- 10 ----------
...
-------- 2087 -------- OK -------- 0 -------- 10 ---------
-------- 2088 -------- OK -------- 0 -------- 10 ---------
-------- 2089 -------- OK -------- 0 -------- 10 ---------
-------- 2090 -------- OK -------- 0 -------- 10 ---------
ALL (2090) "/version" requests responded correctly...Bye.
Test end: 2016-12-12T17:22:16.226000Z