pjsip / pjproject

PJSIP project
http://www.pjsip.org
GNU General Public License v2.0
2.01k stars 767 forks source link

TCP + gnutls (SSL_SOCK) adds random 43ms delay during `pj_ioqueue_poll` #3703

Closed AmarOk1412 closed 1 year ago

AmarOk1412 commented 1 year ago

Describe the bug

Build pjsip 2.13 with gnutls for ssl socks and use any app to send/write data in TCP

The polling get random 43ms delays.

Steps to reproduce

Build pjsip 2.13 with gnutls for ssl socks and use any app to send/write data in TCP

The polling get random 43ms delays.

PJSIP version

2.13

Context

Latest version of gnutls/pjsip. This is a log from a simple ping/pong:

35649668610565  pj_sock_select 0x7f952004a0b8 start 1023
Connected in 6.4 s
35649668637889 pj_ioqueue_send 0x7f951c0741d8
35649668639418 pj_sock_send 0x7f951c0741d8
35649668640234 pj_sock_send 31
35649712050699  pj_sock_select 0x7f952004a0b8 returns 1, time=43440 usec
35649712094687 ioqueue_add_to_set2 0x7f952004a0b8 0x7f9520048ee8 1
35649712099695 pj_ioqueue_poll 0x7f952004a0b8 select
35649712101929  pj_sock_select 0x7f952004a0b8 start 1023
35649712105943  pj_sock_select 0x7f952004a0b8 returns 0, time=4 usec
35649712109822 pj_ioqueue_poll 0x7f952004a0b8 select
35649712111300  pj_sock_select 0x7f952004a0b8 start 1023
35649712169388 pj_ioqueue_send 0x7f952004a0b8
35649712182834 pj_sock_send 0x7f952004a0b8
35649712184752 pj_sock_send 31
35649712212966  pj_sock_select 0x7f951c0741d8 returns 1, time=43659 usec
35649712226210 ioqueue_add_to_set2 0x7f951c0741d8 0x7f951c06eda8 1
35649712229626 pj_ioqueue_poll 0x7f951c0741d8 select
35649712231120  pj_sock_select 0x7f951c0741d8 start 1023
35649712233594  pj_sock_select 0x7f951c0741d8 returns 0, time=2 usec
35649712237552 pj_ioqueue_poll 0x7f951c0741d8 select
35649712239061  pj_sock_select 0x7f951c0741d8 start 1023
Streamed 1 bytes back and forth in 43.7 ms (0 kBps)

Log, call stack, etc

Patch:

https://github.com/savoirfairelinux/pjproject/commit/97f45c2040c2b0cf6f3349a365b0e900a2267333.patch

Similar discussion about CURL suffering the same issue: https://curl.se/mail/lib-2016-06/0167.html

For ICE/STUN/TURN we do not see any reason to not have TCP_NODELAY
sauwming commented 1 year ago

First, I have to note that there are some differences between your codebase and ours: https://github.com/savoirfairelinux/pjproject/blob/master/pjlib/src/pj/sock_bsd.c#L591 and https://github.com/pjsip/pjproject/blob/master/pjlib/src/pj/sock_bsd.c#L577

And here's what ChatGPT says about TCP_NODELAY: " Implications of Using TCP_NODELAY:

  1. Reduced Delay: The primary implication of using TCP_NODELAY is that it reduces the delay in sending data over a TCP connection. By disabling Nagle's algorithm, which is responsible for aggregating small outgoing packets into larger ones to improve network efficiency, TCP_NODELAY ensures that data is sent immediately without waiting for more data to accumulate. This can be beneficial for applications where low latency is crucial, such as real-time communication protocols (e.g., VoIP, online gaming) and interactive applications.

  2. Predictable Timing: With TCP_NODELAY, you have more control over when data is sent. This predictability can be essential in scenarios where precise timing is required, such as multimedia streaming or control systems.

Potential Downsides of Using TCP_NODELAY:

  1. Increased Network Traffic: One of the main downsides of disabling Nagle's algorithm is that it can lead to increased network traffic. By sending smaller packets more frequently, you may saturate the network with smaller messages, which can be less efficient than sending larger packets when appropriate. This can potentially impact overall network performance and consume more bandwidth.

  2. Higher CPU Utilization: Disabling Nagle's algorithm may lead to increased CPU utilization, as it can result in more frequent packet processing and context switches. This additional processing overhead may become noticeable in high-throughput scenarios, especially on systems with limited CPU resources.

  3. Potential for Packet Fragmentation: Sending small packets without aggregation can increase the chances of packet fragmentation, particularly when dealing with networks with a lower Maximum Transmission Unit (MTU). Fragmentation can lead to additional processing overhead and potentially introduce latency.

  4. Inefficient for Bulk Data Transfer: For applications that primarily transfer large amounts of data in a streaming fashion, disabling Nagle's algorithm and using TCP_NODELAY may not be beneficial. In such cases, the efficiency gained from aggregating data into larger packets can outweigh the latency reduction.

  5. Risk of Congestion: Sending data without considering network congestion control mechanisms can result in congestion-related problems. TCP congestion control algorithms are designed to prevent network congestion by adjusting the sending rate. By disabling Nagle's algorithm, you may bypass these congestion control mechanisms, potentially causing network congestion and packet loss.

In summary, using TCP_NODELAY can be advantageous when low latency is critical, but it should be employed judiciously and with a good understanding of the trade-offs. "

Since it only seems to affect a very specific case, i.e. TLS, and more specifically, gnuTLS, I wonder if the fix would be better put in pj_ssl_sock_param instead (i.e. add pj_ssl_sock_param.sockopt_nodelay which defaults to PJ_TRUE #if (PJ_SSL_SOCK_IMP == PJ_SSL_SOCK_IMP_GNUTLS)).

AmarOk1412 commented 1 year ago

Our stack is indeed really different (ICE over TCP is supported) but for the minimal code, it's just ioqueue send/select. The other flags may or may not interest you, I didn't open any ticket about other flags ;) (https://blog.cloudflare.com/when-tcp-sockets-refuse-to-die/ may interest you for other flags, but not related to this ticket). The thing is, pjsip supports ssl sockets, and in this use case, other may have the same issue without our stack.

And it's indeed the case only with GnuTLS (as far as we tested), so I definitely agree with your last point.

sauwming commented 1 year ago

And it's indeed the case only with GnuTLS (as far as we tested), so I definitely agree with your last point.

So will you create a patch for this or do you want us to do so?

Update: I have created a patch in #3708