Infinite loop in `ssl_tls_write()`

neutrinolabs / xrdp

xrdp: an open source RDP server

http://www.xrdp.org/

Apache License 2.0

5.77k stars 1.73k forks source link

Infinite loop in `ssl_tls_write()` #3229

Open metalefty opened 2 months ago

metalefty commented 2 months ago

An enterprise user reports that they're seeing a problem where xrdp 0.10.1 is stuck in a logging loop.

[2024-08-28T08:38:02.147-0400] [INFO ] Received memory_allocation_complete command. width: 7680, height: 4320
[2024-08-28T17:48:35.369-0400] [ERROR] SSL_read: I/O error
[2024-08-28T17:48:35.370-0400] [ERROR] xrdp_iso_send: trans_write_copy_s failed
[2024-08-28T17:48:35.370-0400] [ERROR] Sending [ITU T.125] DisconnectProviderUltimatum failed
...
[2024-08-28T17:49:23.596-0400] [ERROR] SSL_write: Failure in SSL library (protocol error?)
[2024-08-28T17:49:23.597-0400] [ERROR] SSL: error:140D010F:SSL routines:SSL_write:bad length
[2024-08-28T17:49:23.597-0400] [ERROR] xrdp_iso_send: trans_write_copy_s failed
[2024-08-28T17:49:23.598-0400] [ERROR] xrdp_mcs_send: xrdp_iso_send failed
[2024-08-28T17:49:23.598-0400] [ERROR] xrdp_sec_send: xrdp_mcs_send failed
[2024-08-28T17:49:23.599-0400] [ERROR] xrdp_rdp_send_data_from_channel: xrdp_sec_send failed
[2024-08-28T17:49:23.599-0400] [ERROR] xrdp_iso_send: trans_write_copy_s failed
[2024-08-28T17:49:23.600-0400] [ERROR] xrdp_mcs_send: xrdp_iso_send failed
[2024-08-28T17:49:23.600-0400] [ERROR] xrdp_sec_send: xrdp_mcs_send failed
[2024-08-28T17:49:23.601-0400] [ERROR] xrdp_rdp_send_data_from_channel: xrdp_sec_send failed
...

I haven't reproduced the issue I guess there are 2 issues.

POLLERR / POLLHUP is not checked
Return value of g_sck_can_send/recv() is not checked

Regarding issue 1, POLLERR / POLLHUP is always checked on FreeBSD but not on GNU/Linux.

Linux

https://man7.org/linux/man-pages/man2/poll.2.html

       POLLERR
              Error condition (only returned in revents; ignored in
              events).  This bit is also set for a file descriptor
              referring to the write end of a pipe when the read end has
              been closed.

       POLLHUP
              Hang up (only returned in revents; ignored in events).
              Note that when reading from a channel such as a pipe or a
              stream socket, this event merely indicates that the peer
              closed its end of the channel.  Subsequent reads from the
              channel will return 0 (end of file) only after all
              outstanding data in the channel has been consumed.

FreeBSD

https://man.freebsd.org/cgi/man.cgi?query=poll&apropos=0&sektion=0&manpath=FreeBSD+14.1-RELEASE+and+Ports&arch=default&format=html

       POLLERR        An  exceptional  condition has occurred on the device or
              socket.  This  flag  is  always  checked,  even  if  not
              present in the events bitmask.

       POLLHUP        The  device  or socket has been disconnected.  This flag
              is always checked, even if not  present  in  the  events
              bitmask.   Note that POLLHUP and POLLOUT should never be
              present in the revents bitmask at the same time.

metalefty commented 2 months ago

I haven't tested but the fix will be like this. Maybe the same change should applied in other places.

@matt335672 What do you think?

diff --git a/common/os_calls.c b/common/os_calls.c
index afd5a958..1952d425 100644
--- a/common/os_calls.c
+++ b/common/os_calls.c
@@ -1567,7 +1567,7 @@ g_sck_can_send(int sck, int millis)
         pollfd.revents = 0;
         if (poll(&pollfd, 1, millis) > 0)
         {
-            if ((pollfd.revents & POLLOUT) != 0)
+            if ((pollfd.revents & (POLLOUT | POLLHUP | POLLERR)) != 0)
             {
                 rv = 1;
             }
diff --git a/common/ssl_calls.c b/common/ssl_calls.c
index 70d2d7c8..63636770 100644
--- a/common/ssl_calls.c
+++ b/common/ssl_calls.c
@@ -1362,11 +1362,11 @@ ssl_tls_write(struct ssl_tls *tls, const char *data, int length)
              *     SSL_ERROR_WANT_WRITE
              */
             case SSL_ERROR_WANT_READ:
-                g_sck_can_recv(tls->trans->sck, SSL_WANT_READ_WRITE_TIMEOUT);
-                continue;
+                break_flag = g_sck_can_recv(tls->trans->sck, SSL_WANT_READ_WRITE_TIMEOUT);
+                break;
             case SSL_ERROR_WANT_WRITE:
-                g_sck_can_send(tls->trans->sck, SSL_WANT_READ_WRITE_TIMEOUT);
-                continue;
+                break_flag = g_sck_can_send(tls->trans->sck, SSL_WANT_READ_WRITE_TIMEOUT);
+                break;

             /* socket closed */
             case SSL_ERROR_ZERO_RETURN:

matt335672 commented 2 months ago

I've had a look at the Linux/FreeBSD issue.

The language used in both is more complicated than it needs to be. The POSIX manpage says:-

POLLERR
   An error has occurred on the device or stream. This flag is only valid in the revents bitmask; it shall be ignored in the events
   member.
POLLHUP
   A device has been disconnected, or a pipe or FIFO has been closed by the last process that had it open for writing. Once set,
   the hangup state of a FIFO shall persist until some process opens the FIFO for writing or until all read-only file descriptors for
   the FIFO are closed. This event and POLLOUT are mutually-exclusive; a stream can never be writable if a hangup has occurred.
   However, this event and POLLIN, POLLRDNORM, POLLRDBAND, or POLLPRI are not mutually-exclusive. This flag is only
   valid in the revents bitmask; it shall be ignored in the events member.

I think these are all saying the same thing, which is that we don't need to set POLLERR or POLLHUP but they are checked (and returned) automatically.

As for the existing code, this looks OK to me. The poll() will return regardless of which flag is set. In ssl_calls.c this means we'll reattempt the write, and then we'll get the error which we can log. That's where these two errors are coming from:-

[2024-08-28T17:48:35.369-0400] [ERROR] SSL_read: I/O error
...
[2024-08-28T17:49:23.596-0400] [ERROR] SSL_write: Failure in SSL library (protocol error?)

So I don't think this needs changing.

What does need changing is that something isn't checking the status of a read or write. It's not obvious looking at the code where this would be.

A couple of questions:- 1) Can the user reproduce this? It looks like it's resize-related 2) Can the user take debug builds from us?

Thanks.

matt335672 commented 2 months ago

I think I saw something like this once during GFX testing.

My guess is we've sent the client something on the GFX channel that it doesn't like. The client is closing the GFX channel (but not the main channel), and then we're logging errors for everything else we send on that channel. We do do a lot of error checking in those code paths.

If that's right, we need to figure out what the client is objecting to.

metalefty commented 2 months ago

@matt335672

Thanks for the info. I've been busy for a while. I haven't asked the end-user if the issue is GFX related but they've taken a stack trace.

The stack trace taken whilst the xrdp process is still running shows

#0  0x00007f87bae11a35 in write () from /lib64/libpthread.so.0
#1  0x00007f87bb5766a0 in g_file_write () from /usr/lib64/xrdp/libcommon.so.0
#2  0x00007f87bb57409a in internal_log_message () from /usr/lib64/xrdp/libcommon.so.0
#3  0x00007f87bb57445a in log_message () from /usr/lib64/xrdp/libcommon.so.0
#4  0x00007f87bb54bafb in xrdp_sec_send () from /usr/lib64/xrdp/libxrdp.so.0
#5  0x00007f87bb5486d6 in xrdp_rdp_send_data_from_channel () from /usr/lib64/xrdp/libxrdp.so.0
#6  0x00007f87bb52fa03 in libxrdp_send_bitmap () from /usr/lib64/xrdp/libxrdp.so.0
#7  0x00005645f41b1384 in xrdp_painter_end_update ()
#8  0x00005645f419fab7 in xrdp_bitmap_invalidate.part ()
#9  0x00005645f41a0f91 in xrdp_bitmap_invalidate ()
#10 0x00005645f41b968e in xrdp_wm_init ()
#11 0x00005645f41b9eb5 in xrdp_wm_check_wait_objs ()
#12 0x00005645f41b3724 in xrdp_process_main_loop ()
#13 0x00005645f41a6cfb in xrdp_listen_main_loop ()
#14 0x00005645f419e29c in main ()

strace output shows evidence that the socket is disconnected (POLLHUP):

poll([{fd=14, events=POLLOUT}], 1, 0) = 1 ([{fd=14, revents=POLLOUT|POLLERR|POLLHUP}])

matt335672 commented 2 months ago

That's useful - thanks.

That's the code that sends GFX updates when the backend isn't running in GFX mode.

My suspicion is one of these lines. Neither has error checking on it:-

https://github.com/neutrinolabs/xrdp/blob/364790bbb1ea30377ac214d2b16ffca15060c40f/libxrdp/libxrdp.c#L586-L587

https://github.com/neutrinolabs/xrdp/blob/364790bbb1ea30377ac214d2b16ffca15060c40f/libxrdp/libxrdp.c#L694-L695

If you're stuck for time at the moment, let me know if you'd like a patch.

metalefty commented 2 months ago

@matt335672 Could you work on patch? I'm still stuck for time for a while.

matt335672 commented 2 months ago

Will do. Should get to it Monday at the latest.

matt335672 commented 2 months ago

I think I can (sort of) reproduce this now with this patch:-

--- a/xrdp/xrdp_wm.c
+++ b/xrdp/xrdp_wm.c
@@ -833,6 +833,13 @@ xrdp_wm_init(struct xrdp_wm *self)
     {
         LOG(LOG_LEVEL_DEBUG, "   xrdp_wm_init: no autologin / auto run detected, draw login window");
         xrdp_login_wnd_create(self);
+static int first_time = 1;
+if (first_time)
+{
+    LOG(LOG_LEVEL_INFO, "Sleeping....");
+    g_sleep(10 * 1000);
+    LOG(LOG_LEVEL_INFO, "Done sleeping....");
+}
         /* clear screen */
         xrdp_bitmap_invalidate(self->screen, 0);
         xrdp_wm_set_focused(self, self->login_window);

I then connect and kill the connection before the sleep finishes.

I don't get an infinite loop, but I get a lot of logging along these lines:-

[2024-09-16T11:19:53.833+0100] [ERROR] [xrdp_iso_send(xrdp_iso.c:639)] xrdp_iso_send: trans_write_copy_s failed
[2024-09-16T11:19:53.833+0100] [ERROR] [xrdp_mcs_send(xrdp_mcs.c:1475)] xrdp_mcs_send: xrdp_iso_send failed
[2024-09-16T11:19:53.833+0100] [ERROR] [xrdp_sec_send(xrdp_sec.c:1328)] xrdp_sec_send: xrdp_mcs_send failed
[2024-09-16T11:19:53.833+0100] [ERROR] [xrdp_rdp_send_data_from_channel(xrdp_rdp.c:680)] xrdp_rdp_send_data_from_channel: xrdp_sec_send failed

Despite my earlier comments, it's nothing to do with GFX - in fact in GFX mode things seem to terminate more quickly.

I'll step through this and figure out where the best place to detect the connection has failed is. Adding a check to each potential write is a lot of work and probably unnecessary. There's probably something that can be done at the top level.

matt335672 commented 2 months ago

After spending quite a lot of time looking into this, I think I'm chasing my own tail. I'm sure I'm missing something.

The very first comment contains this line :-

[2024-08-28T08:38:02.147-0400] [INFO ] Received memory_allocation_complete command. width: 7680, height: 4320

That has to be generated by xup.c. In turn that means the client has authenticated and is trying to connect to a session.

The stacktrace is related to the login screen being drawn. I didn't think that was possible after we've authenticated and are trying to connect to a session.

This looks like two separate problems. To stop even more flailing about on my part, I'm going to ignore the stacktrace for now.

Going back to the logging above, we've got this message:-

[2024-08-28T17:48:35.370-0400] [ERROR] Sending [ITU T.125] DisconnectProviderUltimatum failed

This is generated here in xrdp_mcs_disconnect():-

https://github.com/neutrinolabs/xrdp/blob/1c33f3d9af22cac303803a4132a6b1aea5ebf1ce/libxrdp/xrdp_mcs.c#L1534-L1540

Following the call stack up:- 1) xrdp_mcs_disconnect() is called from one place only, in xrdp_sec_disconnect() 2) xrdp_sec_disconnect() is called from one place only, in xrdp_rdp_disconnect() 3) xrdp_rdp_disconnect() is called from one place only in libxrdp_disconnect() 4) libxrdp_disconnect() is called from two places. These are both in xrdp_process_main_loop(), and could be replaced with a single call.

After libxrdp_disconnect() returns, I can't see how a separate xrdp process would fail to exit.

@metalefty - can you check a couple of things with the user:- 1) Are they running with fork=true in xrdp.ini? 2) Can you ask them to set EnableProcessId=true in the [Logging] section? That should let us tie up the logging with the actual xrdp processes being used. I'd like to know if these messages are coming from one xrdp process, or many xrdp processes.

metalefty commented 2 months ago

Are they running with fork=true in xrdp.ini?

Yes.

Can you ask them to set EnableProcessId=true in the [Logging] section? That should let us tie up the logging with the actual xrdp processes being used. I'd like to know if these messages are coming from one xrdp process, or many xrdp processes.

Yes, trying to reproduce with EnableProcessId=true now.