Segmentation fault when writing lots of glyphs with RemoteFX codec

RolKau commented 2 years ago

xrdp version: 0.9.17 (git sha 5808832) xorgxrdp version: 0.2.17 (git sha b943f0e) Client: FreeRDP 2.2.0+dfsg1-0ubuntu0.20.04.2

First I compile xrdp with:

./configure --enable-fuse --enable-jpeg --enable-pixman --enable-devel-debug --enable-devel-streamcheck

I connect to the server with these options:

xfreerdp /v:hostname /sec:rdp /gfx:rfx /rfx

This crashes immediately upon connect, with backtrace (from coredumpctl debug):

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007f838fe5f859 in __GI_abort () at abort.c:79
#2  0x00007f8390162b67 in parser_stream_overflow_check (s=0x7ffcae525800, n=2, is_out=1, file=0x7f839014528c "libxrdp.c", line=1678) at parse.c:52
#3  0x00007f83900fe8b9 in libxrdp_fastpath_send_surface (session=0x560ca6dda1e0, data_pad=0x7f8384000b60 "", pad_bytes=256, data_bytes=101759, destLeft=0, destTop=0, destRight=3840, destBottom=2160, bpp=32, codecID=3, width=3840, height=2160) at libxrdp.c:1678
#4  0x0000560ca5f4990a in xrdp_mm_process_enc_done (self=0x560ca6ddf7c0) at xrdp_mm.c:2805
#5  0x0000560ca5f49bba in xrdp_mm_check_wait_objs (self=0x560ca6ddf7c0) at xrdp_mm.c:2903
#6  0x0000560ca5f55c7a in xrdp_wm_check_wait_objs (self=0x560ca6e13a70) at xrdp_wm.c:2210
#7  0x0000560ca5f4f394 in xrdp_process_main_loop (self=0x560ca6dda090) at xrdp_process.c:287
#8  0x0000560ca5f3bf88 in xrdp_process_run (in_val=0x0) at xrdp_listen.c:152
#9  0x0000560ca5f3d3f8 in xrdp_listen_fork (self=0x560ca6dd44f0, server_trans=0x560ca6dd8380) at xrdp_listen.c:802
#10 0x0000560ca5f3d871 in xrdp_listen_main_loop (self=0x560ca6dd44f0) at xrdp_listen.c:954
#11 0x0000560ca5f31cd0 in main (argc=2, argv=0x7ffcae526408) at xrdp.c:705

and the line in /var/log/xrdp.log:

libxrdp.c:1678 Stream output buffer overflow. Size=0, pos=7, requested=2

If I however now recompile again without the range check:

./configure --enable-fuse --enable-jpeg --enable-pixman --enable-devel-debug

I can log on, even with more caching options enabled:

xfreerdp /v:hostname /sec:rdp /gfx:rfx /rfx /codec-cache:rfx +bitmap-cache +offscreen-cache +glyph-cache +clipboard /f

I can now start Firefox and browse a bunch of webpages without any problem. But, if I put two terminal windows sized to 1920x2160 side-by-side and run the command od -c /dev/random in each of them, the server coredumps rather quickly with this backtrace:

#0  0x0000561d8ac3f5e6 in loop1f ()
#1  0x0000561d8ac3f7f4 in rfxcodec_encode_dwt_shift_amd64_sse41 ()
#2  0x0000561d0f324928 in ?? ()
#3  0x00007fcb7af9b300 in ?? ()
#4  0x0000561d8c214460 in ?? ()
#5  0x0000561d8c216460 in ?? ()
#6  0x0000561d8c210410 in ?? ()
#7  0x0000561d8ac3fb27 in rfx_encode_component_rlgr3_amd64_sse41 (enc=0x561d8c210410, qtable=<optimized out>, data=<optimized out>, buffer=0x7fcb2460aad9 "", buffer_size=-39721, size=0x7fcb2bcb38cc) at rfxencode_tile_amd64.c:99
#8  0x0000561d8ac3972e in rfx_encode_yuv (enc=enc@entry=0x561d8c210410, yuv_data=0x7fcb7af9b300 <error: Cannot access memory at address 0x7fcb7af9b300>, width=<optimized out>, height=<optimized out>, stride_bytes=stride_bytes@entry=15360, y_quants=<optimized out>, u_quants=0x561d471df50f <error: Cannot access memory at address 0x561d471df50f>, v_quants=0x561d8dbf62b7 "\025\300\215&05\222\340d_Ix\006\253\b\200\300GA\264~Q\001dV\377\345\374\302GĤ{\260 \006*\001\346\222\374\377&\002\241\376\f\037ܸ\207|", data_out=0x7fcb2bcb3980, y_size=0x7fcb2bcb38cc, u_size=0x7fcb2bcb38d0, v_size=0x7fcb2bcb38d4) at rfxencode_tile.c:221
#9  0x0000561d8ac39343 in rfx_compose_message_tile_yuv (yIdx=<optimized out>, xIdx=<optimized out>, quantIdxCr=<optimized out>, quantIdxCb=<optimized out>, quantIdxY=<optimized out>, quantVals=0x561d8ac44e9c <g_rfx_default_quantization_values> "ffw\210\230"<error: Cannot access memory at address 0x561d8ac44ea1>, stride_bytes=15360, tile_height=<optimized out>, tile_width=<optimized out>, tile_data=<optimized out>, s=0x7fcb2bcb3980, enc=0x561d8c210410) at rfxencode_compose.c:232
#10 rfx_compose_message_tileset (width=-1966846308, height=2176, flags=0, num_quants=0, quants=0x0, num_tiles=1394, tiles=0x7fcb24600fb0, stride_bytes=15360, buf=0x7fcb294d4000 "", s=0x7fcb2bcb3980, enc=0x561d8c210410) at rfxencode_compose.c:488
#11 rfx_compose_message_data (enc=enc@entry=0x561d8c210410, s=s@entry=0x7fcb2bcb3980, regions=<optimized out>, num_regions=num_regions@entry=12, buf=buf@entry=0x7fcb294d4000 "", width=width@entry=3840, height=2176, stride_bytes=15360, tiles=0x7fcb24600fb0, num_tiles=1394, quants=0x0, num_quants=0, flags=0) at rfxencode_compose.c:589
#12 0x0000561d8ac385ba in rfxcodec_encode_ex (handle=0x561d8c210410, cdata=<optimized out>, cdata_bytes=0x7fcb2bcb3a8c, buf=0x7fcb294d4000 "", width=3840, height=2176, stride_bytes=15360, regions=0x7fcb2460a828, num_regions=12, tiles=0x7fcb24600fb0, num_tiles=1394, quants=0x0, num_quants=0, flags=0) at rfxencode.c:333
#13 0x0000561d8ac3865f in rfxcodec_encode (handle=<optimized out>, cdata=<optimized out>, cdata_bytes=<optimized out>, buf=<optimized out>, width=<optimized out>, height=<optimized out>, stride_bytes=15360, regions=0x7fcb2460a828, num_regions=12, tiles=0x7fcb24600fb0, num_tiles=1394, quants=0x0, num_quants=0) at rfxencode.c:352
#14 0x0000561d8ac1eccf in process_enc_rfx (self=0x561d8c20b870, enc=0x561d8c1cf640) at xrdp_encoder.c:384
#15 0x0000561d8ac1effa in proc_enc_msg (arg=0x561d8c20b870) at xrdp_encoder.c:507
#16 0x00007fcb2ef76609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#17 0x00007fcb2f0b2293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

The buffer_size parameter in line 7 looks suspicious. Unfortunately, if I build with CFLAGS="-O0", the backtrace gets cut off after item 6 instead of getting more information for some unknown reason.

It seems to me that this bug is triggered whenever a lot of text is written as characters to the display; programs that draws the text themselves as graphics don't elicit this behaviour.

matt335672 commented 2 years ago

Hi @RolKau

Thanks for raising these.

The first one is addressed in #2066 which I'm about to merge.

The second one clearly needs some investigation.

The way I'll do this (when I get round to it) is to build with optimisation disabled, and then attach a debugger to Xorg before running the od.

matt335672 commented 2 years ago

Having merged #2066, I'm now getting this one when I use your second command:-

[20211202-11:26:41] [CORE ] [parser_stream_overflow_check(parse.c:49)] xrdp_rdp.c:826 Stream output buffer overflow. Size=13, pos=13, requested=2

I'll start working through these. It's likely there's a buffer overflow involved in this.

matt335672 commented 2 years ago

I've fixed the stream overflow checks now (see #2070), but I've not gotten any further on your second problem. I don't have a 4K screen. I've tried reproducing your problem with panning enabled, but no luck so far.

Looking at your stack trace, something odd is happening at line #10. The width -1966846308 should be 3840, as this value is past straight through in the width variable in the code. I've not figured out yet how this could happen.

A couple of questions for you:- 1) Which desktop are you using? 2) Output of xrdp -v?

Thanks.

RolKau commented 2 years ago

I am using Xfce 4.16 as desktop environment (and thus the od commands were run in xfce4-terminal). Output of xrdp -v is:

xrdp 0.9.17
  A Remote Desktop Protocol Server.
  Copyright (C) 2004-2020 Jay Sorg, Neutrino Labs, and all contributors.
  See https://github.com/neutrinolabs/xrdp for more information.

  Configure options:
      --enable-fuse
      --enable-jpeg
      --enable-pixman
      --enable-devel-debug
      CFLAGS=-O0

  Compiled with OpenSSL 1.1.1f  31 Mar 2020

The original build with the stack trace was built without CFLAGS=-O0; this is the second build, attempting to get a better backtrace. Come to think of it, perhaps the problem is that the stack is corrupted? If I dump the other thread in the core, I get a sensible backtrace with files and line numbers.

matt335672 commented 2 years ago

I'll have a go with xfce - I've been trying GNOME up 'til now. It'll probably be next week now.

On your CFLAGS try adding -g -fvar-tracking. That might be more informative.

If you can get a coredump out of xrdp and you can post it somewhere, together with the xrdp executable, I'll be happy to have look at it.

RolKau commented 2 years ago

@matt335672 I managed to reproduce this in a container, and I've posted an archive with the program file and core.

It is built with:

sed -i "s/\(AX_APPEND_COMPILE_FLAGS\)(\[-g -O0\])/\1([-ggdb3 -O0])/" configure.ac
./bootstrap
./configure --enable-fuse --enable-jpeg --enable-pixman --enable-devel-debug CFLAGS="-O0 -ggdb3 -fvar-tracking"
make
sudo make install

matt335672 commented 2 years ago

Brilliant - thanks for that.

I'd like to say I'll take a look right now, but I've been fairly snowed under with issue commenting since the weekend. I will get to it as soon as I can.

RolKau commented 2 years ago

Regarding reproducability: It doesn't seem to be dependent on Xfce at all; I've been able to reproduce it by setting ~/.xsession to

/usr/bin/xterm -geometry 3840x2160

and then running the command od -w200 -c /dev/random in this window (it actually crashes faster).

matt335672 commented 2 years ago

xterm uses old school fonts rather than Xft ones, so it's expecting the server to get involved a bit more. That's a useful observation.

matt335672 commented 2 years ago

Quick update - I couldn't get anywhere with the coredump, as the libraries weren't included. My bad - I should have though of that.

On the plus side I've been able to reproduce the crash by setting up a panning display of your size, and running the Windows 10 RDP client in a window. With your rather marvellous xterm hack above, the crash is rather quick, so thanks for that. I'm out of time today, but I'll resume tomorrow.

RolKau commented 2 years ago

FWIW, you can get an even more automated setup by putting the entire command /usr/bin/xterm -geometry 3840x2160 -e /bin/sh -c "od -w200 -c /dev/random" in ~/.xsession.

I added an assert to function rfx_encode_component_rlgr3_amd64_sse41() in librfxcodec/src/amd64/rfxencode_tile_amd64.c, checking if buffer_size is < 0, but alas it didn't trigger, so I'm leaning against thinking that (most of) the bogus parameter values we see is due to the stack being overwritten.

matt335672 commented 2 years ago

There seems to be something odd going on in the amd64 assembler code.

I've managed to build the assembler with debug, and when the error triggers I'm getting this partial stack trace:-

#0  loop1f () at rfxcodec_encode_dwt_shift_amd64_sse41.asm:1069
#1  0x0000558bd8d50744 in rfxcodec_encode_dwt_shift_amd64_sse41 () at rfxcodec_encode_dwt_shift_amd64_sse41.asm:1220
#2  0x0000558bbc599f66 in ?? ()
#3  0x00007ff6ce3e6600 in ?? ()
#4  0x0000558bd9ecdcb0 in ?? ()
#5  0x0000558bd9ecfcb0 in ?? ()
#6  0x0000000000000000 in ?? ()

Building librfxcodec without the assembler support doesn't trigger the error.

RolKau commented 2 years ago

My coredump is at the same location.

rsi contains an invalid memory address, but it seems to only be modified according to the loop counter. As ecx is 8, which is also the initial value, my hunch is that rsi was corrupt already when rfx_dwt_2d_encode_block_verti_8_64 was called.

From the second frame, in function rfxcodec_encode_dwt_shift_amd64_sse41, rsi is simply forwarded from a stack parameter, but the call to that again in rfx_encode_component_rlgr1_amd64_sse41 should be equal as when compiling without assembler support. Hm. Maybe the stack was corrupted by some other code, and this is just the poor guy ending up with the buck?

RolKau commented 2 years ago

I think I have made some progress. Here is a write-up so you can follow the reasoning. I'm going to skip some of the intermediate dead ends, so it will seem that I am a little more clairvoyant than I really am.

I started out by observing that some of the buffers that are passed into the frame that traps points to invalid memory, so I added this line to librfxcodec/src/amd64/rfxencode_tile_amd64.c:100

if ((*qtable ^ (char) *data ^ (char) enc->dwt_buffer1 ^ (char) enc->dwt_buffer) == 0x7F) __asm__ volatile("int $0x03");

It just accesses all the arrays so that it will trap if one of them points to invalid memory. The test and interrupt is just so that the compiler cannot optimize this code away. I'm taking my chances that the memory doesn't end up accidentally satisfying exactly the condition I put in here.

Running with this line in effect now traps with this backtrace:

#0  0x0000555555577959 in rfx_encode_component_rlgr3_amd64_sse41 (enc=0x5555555f1180, qtable=0x55558ffafaa5 <error: Cannot access memory at address 0x55558ffafaa5>, data=0x7fff90c08500 <error: Cannot access memory at address 0x7fff90c08500>, buffer=0x7fffec30bca7 "", buffer_size=-45127, size=0x7ffff52818ec) at rfxencode_tile_amd64.c:100
#1  0x000055555557153e in rfx_encode_yuv (enc=enc@entry=0x5555555f1180, yuv_data=0x7fff90c08500 <error: Cannot access memory at address 0x7fff90c08500>, width=<optimized out>, height=<optimized out>, stride_bytes=stride_bytes@entry=15360, y_quants=<optimized out>, u_quants=0x5555bf6fa587 <error: Cannot access memory at address 0x5555bf6fa587>, v_quants=0x555519be126e <error: Cannot access memory at address 0x555519be126e>, data_out=0x7ffff52819a0, y_size=0x7ffff52818ec, u_size=0x7ffff52818f0, v_size=0x7ffff52818f4) at rfxencode_tile.c:221
#2  0x0000555555571153 in rfx_compose_message_tile_yuv (yIdx=<optimized out>, xIdx=<optimized out>, quantIdxCr=<optimized out>, quantIdxCb=<optimized out>, quantIdxY=<optimized out>, quantVals=0x55555557cd8c <g_rfx_default_quantization_values> "ffw\210\230", stride_bytes=15360, tile_height=<optimized out>, tile_width=<optimized out>, tile_data=<optimized out>, s=0x7ffff52819a0, enc=0x5555555f1180) at rfxencode_compose.c:232
#3  rfx_compose_message_tileset (width=1431817612, height=2176, flags=0, num_quants=0, quants=0x0, num_tiles=2040, tiles=0x7fffec300c60, stride_bytes=15360, buf=0x7ffff2aa2000 "", s=0x7ffff52819a0, enc=0x5555555f1180) at rfxencode_compose.c:488
#4  rfx_compose_message_data (enc=enc@entry=0x5555555f1180, s=s@entry=0x7ffff52819a0, regions=<optimized out>, num_regions=num_regions@entry=1, buf=buf@entry=0x7ffff2aa2000 "", width=width@entry=3840, height=2176, stride_bytes=15360, tiles=0x7fffec300c60, num_tiles=2040, quants=0x0, num_quants=0, flags=0) at rfxencode_compose.c:589
#5  0x00005555555703ca in rfxcodec_encode_ex (handle=0x5555555f1180, cdata=cdata@entry=0x7fffec000c60 "\304\314\016", cdata_bytes=cdata_bytes@entry=0x7ffff5281ab4, buf=0x7ffff2aa2000 "", width=3840, height=2176, stride_bytes=15360, regions=0x7fffec30eb80, num_regions=1, tiles=0x7fffec300c60, num_tiles=2040, quants=0x0, num_quants=0, flags=0) at rfxencode.c:333
#6  0x000055555557046f in rfxcodec_encode (handle=<optimized out>, cdata=cdata@entry=0x7fffec000c60 "\304\314\016", cdata_bytes=cdata_bytes@entry=0x7ffff5281ab4, buf=<optimized out>, width=<optimized out>, height=<optimized out>, stride_bytes=15360, regions=0x7fffec30eb80, num_regions=1, tiles=0x7fffec300c60, num_tiles=2040, quants=0x0, num_quants=0) at rfxencode.c:352
#7  0x000055555555f224 in process_enc_rfx (self=0x5555555ae2c0, enc=0x5555556a9970) at xrdp_encoder.c:384
#8  0x000055555555f5db in proc_enc_msg (arg=0x5555555ae2c0) at xrdp_encoder.c:507
#9  0x000055555555f622 in proc_enc_msg (arg=<optimized out>) at xrdp_encoder.c:454
#10 0x00007ffff7cba609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#11 0x00007ffff7df6293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Of interest is that the width parameter is OK in frame #4, but not in frame #3. The code in rfxencode_compose.c from line 581 to 589 doesn't alter it, so I reckoned it must have been corrupted somewhere in this region. This turned out to be a wild goose chase; it is probably altered later, by someone overwriting the stack.

Next attempt was walking back down again from frame #4 and noticing that the yuv_data parameter is corrupted in frame #1. Thus, I added this line at librfxcodec/src/rfxencode_tile.c:221

if (*yuv_data == 0x7F) __asm__ volatile("int $0x03");

This gave me another trap where now tile_data in the frame above the trap location was invalid, so I repeated the trick at librfxcodec/src/rfxencode_compose.c:232, and then repeating this yet again at librfxcodec/src/rfxencode_compose.c:489 (which is really 488 in the original source since we added another line earlier), I now get another trap

#0  0x000055555557f221 in rfx_compose_message_tileset (enc=enc@entry=0x555555601240, s=s@entry=0x7ffff5281980, buf=buf@entry=0x7ffff2aa2000 "", width=width@entry=3840, height=height@entry=2176, stride_bytes=stride_bytes@entry=15360, tiles=0x7fffec300c60, num_tiles=2040, quants=0x0, num_quants=0, flags=0) at rfxencode_compose.c:490
#1  0x000055555557f865 in rfx_compose_message_data (enc=enc@entry=0x555555601240, s=s@entry=0x7ffff5281980, regions=regions@entry=0x7fffec30eb80, num_regions=num_regions@entry=1, buf=buf@entry=0x7ffff2aa2000 "", width=width@entry=3840, height=2176, stride_bytes=15360, tiles=0x7fffec300c60, num_tiles=2040, quants=0x0, num_quants=0, flags=0) at rfxencode_compose.c:595
#2  0x000055555557d677 in rfxcodec_encode_ex (handle=handle@entry=0x555555601240, cdata=cdata@entry=0x7fffec000c60 "\304\314\016", cdata_bytes=cdata_bytes@entry=0x7ffff5281a8c, buf=buf@entry=0x7ffff2aa2000 "", width=width@entry=3840, height=height@entry=2176, stride_bytes=15360, regions=0x7fffec30eb80, num_regions=1, tiles=0x7fffec300c60, num_tiles=2040, quants=0x0, num_quants=0, flags=0) at rfxencode.c:333
#3  0x000055555557d716 in rfxcodec_encode (handle=0x555555601240, cdata=cdata@entry=0x7fffec000c60 "\304\314\016", cdata_bytes=cdata_bytes@entry=0x7ffff5281a8c, buf=0x7ffff2aa2000 "", width=3840, height=2176, stride_bytes=15360, regions=0x7fffec30eb80, num_regions=1, tiles=0x7fffec300c60, num_tiles=2040, quants=0x0, num_quants=0) at rfxencode.c:352
#4  0x0000555555563cdf in process_enc_rfx (self=0x5555555f7ab0, enc=0x5555556bbd60) at xrdp_encoder.c:384
#5  0x000055555556400a in proc_enc_msg (arg=0x5555555f7ab0) at xrdp_encoder.c:507
#6  0x00007ffff7cba609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#7  0x00007ffff7df6293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

Now all parameters look fine, but it still traps trying to access tile_data. The line above the trap reads tile_data = buf + (y << 8) * (stride_bytes >> 8) + (x << 8); so checking these variables we see that buf and stride_bytes have the same values that the function was called with, but x and y looks corrupted. index and num_tiles look sensible, so some-one is probably overwriting the contents of the tiles array.

Trying to set a hardware breakpoint on this array at the start of that function, results in traps at various places in rfx_encode_diff_rlgr3 in rfxencode_diff_rlgr3.c, sometimes at line 255, some times at 236; that may be random from uninitialized memory.

If I continue from this watchpoint, the next time it traps at calculating tile_data, the data in tiles is still the same that rfx_encode_diff_rlgr3 changed them to.

I now think that this is the one that is overwriting other memory, but I haven't started looking at it.

matt335672 commented 2 years ago

That's excellent work - I like the __asm__ volatile("int $0x03") as a simple trap mechanism - I'll make a note of that!

I've got a bit of time to dive into this today, so with a bit of luck I'll make a bit more progress.

matt335672 commented 2 years ago

Hi @RolKau

First off and most importantly, thanks again for the splendid work you've done on this above. I've managed to make a bit of progress, but without your input I certainly wouldn't have made it this far in the time I had.

I reproduced what you'd done above. When I hit the trap, I examined the tiles array and discovered it had been completely overwritten from the start up to at least the erroneous tile.

I traced back, and the tiles array is allocated in xrdp/xrdp_encoder.c:-

https://github.com/neutrinolabs/xrdp/blob/a266d67fdfef886793df1e623eb05c95da71af37/xrdp/xrdp_encoder.c#L342-L353

The tiles array is part of a larger buffer, allocated with a single call which looks like this:-

              <----------Output areas----------->  <-----Input areas----->
            +----------------+-------------------+------------+-----------+
out_data -> | output_header  | compressed_output |  tile_data | rect data |
            +----------------+-------------------+------------+-----------+
                                                  ^            ^
tiles---------------------------------------------+            |
                                                               |
rects----------------------------------------------------------+

The whole lot is passed to the compressor. It reads the tile data and the rect data and from those two constructs the compressed_output. Finally the output header is filled in, and the output_header+compressed_output are sent to the client. Then the whole array is deallocated.

I rejigged the code to allocate separate areas for the inputs and outputs. Then I ran the whole lot through valgrind. Since the output array was now visible as a single entity to valgrind I could see the output array violations being generated in rfx_encode_diff_rlgr3() in the places you mentioned.

In short, what seems to be happening is that for large screens and some particular output configurations, the compressed data buffer is larger than the compressed_output buffer. rfx_encode_diff_rlgr3() does not check for buffer overflows.

I've added some buffer checks and created a patch:-

issue2068.patch.txt

It seems to work OK on my setup, but given everything I've got is virtualised or panned it really needs checking on some decent hardware.

RolKau commented 2 years ago

I can confirm that the patch (on top of 0.9.17) works; not without dropping frames of course, but at least it doesn't crash, on "real" hardware, i.e. a FreeRDP client running on a monitor with 4k resolution (and the server, albeit containerized, running on a non-virtual kernel).

matt335672 commented 2 years ago

Thanks.

The problem seems to be that the codec is assuming all the compressed surface command data will fit in a fastpath segment, and for large screens this may not be the case. Looking at [MS-RDPBCGR], it seems that the correct solution is to split the surface command updates over multiple TS_FP_SURFCMDS PDUs. That's quite a lot of work however, given the current architecture in xrdp/xrdp_encoder.c.

The patch I've given you isn't even logging the buffer overflow, so although it illustrates we've found the problem, more needs to be done to make even this approach workable. Without the logging, we'll just expend a lot of effort in the future try to figure out why frames are being dropped.

The pragrmatic approach is (I think) to log the overflow, and then figure out a way to limit the number of tiles that get passed to the encoder at once.

I'll start a discussion on this, which hopefully will gather some inputs from people who know more about this area than I do.

RolKau commented 2 years ago

I think that this may be more prevalent than we think, too. I applied the patch to my actual workstation, thinking that missing frames are at least better than a (partially) lost session, and where I before had a crash around once a day, I now notice update glitches far more often than that, which makes me think that the overrun only hit invalid memory some and not all times.

matt335672 commented 2 years ago

Yes, that wouldn't surprise me for large screens.

I've just been poking around in the fastpath code, which I'll admit I'm not totally familiar with and reading some specs. I have something else for you to try, which may give a more satisfactory result.

When the client connects to the server, both exchange 'capability sets' which detail what facilities are supported by the other side. The server sends its capabilities first, and the client replies.

The value which determines how big an RFX surface update can be is the 'fastpath max fragment size'. This is the total update size, regardless of any fragmentation which may be implemented.

We send our fastpath max fragment size here :-

https://github.com/neutrinolabs/xrdp/blob/5a11b698ef0a501d01daa2addf63b501d79935c5/libxrdp/xrdp_caps.c#L1140-L1147

It's currently hard-coded to 3MB on line 1144.

See also this code in freerdp :-

https://github.com/FreeRDP/FreeRDP/blob/1265114be7d93cd4e521b88d3e271ea84b1ae13c/libfreerdp/core/capabilities.c#L2493-L2511

And also (as indicated) the second bullet-point in this section of [MS-RDPRFX]. This is the RFX specification. In summary for RemoteFX the client has to support a 'fastpath max fragment size' at least as big as the one we send it. So we can control the size of the buffer by sending a larger value here, like freerdp does.

Your screen resolution is (IIRC) 3840 x 2160. The FreeRDP code would be sending a value of (60 34 16384 + 16384) for this, which is 33439744. This is quite a bit bigger than our value by a factor or 10 or so.

Please, at this point, check both my arithmetic and my reasoning. Either could be faulty!

If you're in agreement with me, can you modify the indicated code in libxrdp/xrdp_caps.c to send the larger value. It will be interesting to see what happens.

matt335672 commented 2 years ago

Here's a patch to implement the freeRDP logic. It works OK for me with MSTSC and xfreerdp on my setup, but I'd really appreciate a second opinion on this one.

I still think the guard and logging are needed in the encoder issue2068.patch.txt .

Nexarian commented 2 years ago

@matt335672 I'd also suggest looking into how Ogon handles this as well.

matt335672 commented 2 years ago

Thanks @Nexarian

Ogon, of course, uses freerdp. Their routine to send an RFX update is here (if I'm reading things correctly):-

https://github.com/ogon-project/ogon/blob/24301079b1ff35f1118ce873b9e8dccf71fc6457/rdp-server/graphics.c#L560-L653

There's an interesting comment at the start of the routine regarding using OGON_USE_MULTIMESSAGE_RFX_ENCODER to split up the single RFX update message. This doesn't appear to be plumbed in however.

Even if it was I can't see the point in using it - a large fastpath PDU will still be fragmented when we send it using xrdp_rdp_send_fastpath().

In short, at the moment the patch I've attached above seems to be the general way forward for both freerdp and Ogon, so there seems little reason for us not to use it too.

I'm still concerned by the buffer overrun in librfxcodec - it's a pig to find when it happens, as this thread will testify.

trishume commented 2 years ago

I discovered and patched this issue in mid-2020 but only emailed it to the xrdp-core list specified for security issues, since I wasn't sure if the buffer overflow with user-controlled data might be exploitable to gain root access from displaying something on screen, which would be approximately the worst possible vulnerability.

I implemented splitting of updates over multiple messages, so that it still works and correctly displays large updates like displaying lots of small glyphs on a 4k screen. I've attached the two patches, one for xrdp and one for librfxcodec. Unfortunately I don't have time right now to make them proper PRs.

patches.zip

matt335672 commented 2 years ago

Thanks @trishume - that's very useful indeed.

Firstly, I must apologise on behalf of the team for none of us picking up your patch from xrdp-core. I can confirm it arrived. For my own part I didn't address it as it's an area I didn't have any knowledge of back then, but I shouldn't have just assumed someone else would pick it up.

I've had a quick read of your patch, and this looks to me to be a decent way to solve the buffer overrun. There's also scope for logging when an overrun happens if we want to do that. At the moment the librfxcodec code can't easily use the logging functionality, but with this change this can be addressed in xrdp_encode.c.

I'd very much like to get something in for v0.9.18, and time is short at the moment. I'm minded to progress this with a combination of the two patches I've given @RolKau above for v0.9.18, and then merging in your functionality after v0.9.18 when I better understand this area, and I've had time to digest and reformat your patches. This is still an area I'm learning about.

@RolKau - please let me know when you've had a chance to check the last patch I've given you to see if it addresses your immediate problem.

@metalefty - are you happy with this general approach, and for us to get something in for v0.9.18 if we can manage it? If you are, I'll sort out a PR, and raise another issue to incorporate some of @trishume's patch post v0.9.18.

trishume commented 2 years ago

I would avoid adding logging, with my patch splitting is just a normal thing that can happen if there's a lot of info. It can potentially happen very frequently, under normal circumstances like a fullscreen 4k terminal spewing a lot of small text, which you wouldn't want filling up your logs.

metalefty commented 2 years ago

@matt335672 Let's get it in v0.9.18 if you can work on the issue.

RolKau commented 2 years ago

I have now run a version consisting of 0.9.17 + #2066 + #2070 + buffer-check-20211209 + buffer-size-20211213 (all referred to from this thread) on my regular workstation. I haven't experienced any crashes yet, so that's very good, and in my opinion a reason to apply them as a first measure in any case.

It also seems to be a bit swifter, although that doesn't really make any sense and I'll attribute that to psychological effects :-) It does however still have some display glitches; a notable case seemed to be having this very page in a half-screen window (i.e. size 1920x2160) open in Firefox and then scrolling a bit up and down using the mouse wheel, but it is not immediately reproducable, it seems to be somewhat context dependent.

I'll try Tristan's patches next, but it'll take a couple of days: Due to mandatory work-from-home in my county at the moment, I have less opportunity to experiment with the main machine (but I'll see if I can give it a go in the container at least).

matt335672 commented 2 years ago

@RolKau - thank you very much for the update on this, particularly with the restrictions in place on your movements.

I'll give this a priority for v0.9.18 so hopefully there will be something available for the end of the month.

matt335672 commented 2 years ago

I've had a bit of a blitz on this, and re-read @trishume's patch. I've tested it, and think I understand it enough now to be able to support it, so I've rebased it and reformatted it according to the xrdp coding standard. I've combined it with other code from this thread to create #2087. It should be clear from the commits who is responsible for what.

@trishume - I know you're not a big fan of our coding standard(!), but if you can find time to add any comments on this we'd be grateful. As far as the librfxcodec changes go in neutrinolabs/librfxcodec#48, I had one question, and I also had to make a minor change to rfx_compose_message_tileset() to prevent a compiler warning (in a separate commit). The compiler's picking up on the pathological case of num_tiles being zero.

RolKau commented 2 years ago

I have now tried a combination of v0.9.17 + #2066 + #2070 + #2087 + neutrinolabs/librfxcodec#48 in a container setup (using an ~/.xsession that does od -w 200 < /dev/random), compiled with CFLAGS="-O2 -ggdb3 -fvar-tracking", and, while it doesn't show any visual glitches, it traps with this backtrace:

#0  0x0000555555577516 in loop1f ()
#1  0x0000555555577724 in rfxcodec_encode_dwt_shift_amd64_sse41 ()
#2  0x000055555557cd8c in ?? ()
#3  0x00007ffff39f4000 in ?? ()
#4  0x00005555555f3840 in ?? ()
#5  0x00005555555f5840 in ?? ()
#6  0x00005555555ef7f0 in ?? ()
#7  0x0000555555577a57 in rfx_encode_component_rlgr3_amd64_sse41 (enc=0x5555555ef7f0, qtable=<optimized out>, data=<optimized out>, buffer=0x7fffec1e3f15, buffer_size=1166667, size=0x7ffff52818e4) at rfxencode_tile_amd64.c:99
#8  0x00005555555716db in check_and_rfx_encode (size=0x7ffff52818e4, buffer_size=<optimized out>, buffer=<optimized out>, data=0x7ffff39f4000 <error: Cannot access memory at address 0x7ffff39f4000>, qtable=0x55555557cd8c <g_rfx_default_quantization_values> "ffw\210\230", enc=0x5555555ef7f0) at rfxencode_tile.c:106
#9  rfx_encode_yuv (enc=enc@entry=0x5555555ef7f0, yuv_data=0x7ffff39f2000 <error: Cannot access memory at address 0x7ffff39f2000>, width=<optimized out>, height=<optimized out>, stride_bytes=stride_bytes@entry=15360, y_quants=<optimized out>, u_quants=0x55555557cd8c <g_rfx_default_quantization_values> "ffw\210\230", v_quants=0x55555557cd8c <g_rfx_default_quantization_values> "ffw\210\230", data_out=0x7ffff5281990, y_size=0x7ffff52818dc, u_size=0x7ffff52818e0, v_size=0x7ffff52818e4) at rfxencode_tile.c:250
#10 0x000055555557120c in rfx_compose_message_tile_yuv (yIdx=<optimized out>, xIdx=<optimized out>, quantIdxCr=<optimized out>, quantIdxCb=<optimized out>, quantIdxY=<optimized out>, quantVals=0x55555557cd8c <g_rfx_default_quantization_values> "ffw\210\230", stride_bytes=15360, tile_height=<optimized out>, tile_width=<optimized out>, tile_data=<optimized out>, s=0x7ffff5281990, enc=0x5555555ef7f0) at rfxencode_compose.c:232
#11 rfx_compose_message_tileset (width=1977081, height=2176, flags=0, num_quants=0, quants=0x0, num_tiles=1740, tiles=0x7fffec300c60, stride_bytes=15360, buf=0x7ffff2aa2000 <error: Cannot access memory at address 0x7ffff2aa2000>, s=0x7ffff5281990, enc=0x5555555ef7f0) at rfxencode_compose.c:495
#12 rfx_compose_message_data (enc=enc@entry=0x5555555ef7f0, s=s@entry=0x7ffff5281990, regions=<optimized out>, num_regions=num_regions@entry=1, buf=buf@entry=0x7ffff2aa2000 <error: Cannot access memory at address 0x7ffff2aa2000>, width=width@entry=3840, height=2176, stride_bytes=15360, tiles=0x7fffec300c60, num_tiles=1740, quants=0x0, num_quants=0, flags=0) at rfxencode_compose.c:605
#13 0x0000555555570496 in rfxcodec_encode_ex (handle=0x5555555ef7f0, cdata=cdata@entry=0x7fffec000c60 "\304\314\016", cdata_bytes=cdata_bytes@entry=0x7ffff5281ab4, buf=0x7ffff2aa2000 <error: Cannot access memory at address 0x7ffff2aa2000>, width=3840, height=2176, stride_bytes=15360, regions=0x7fffec30cab0, num_regions=1, tiles=0x7fffec300c60, num_tiles=1740, quants=0x0, num_quants=0, flags=0) at rfxencode.c:334
#14 0x000055555557053f in rfxcodec_encode (handle=<optimized out>, cdata=cdata@entry=0x7fffec000c60 "\304\314\016", cdata_bytes=cdata_bytes@entry=0x7ffff5281ab4, buf=<optimized out>, width=<optimized out>, height=<optimized out>, stride_bytes=15360, regions=0x7fffec30cab0, num_regions=1, tiles=0x7fffec300c60, num_tiles=1740, quants=0x0, num_quants=0) at rfxencode.c:350
#15 0x000055555555f24d in process_enc_rfx (self=0x5555555ac140, enc=0x5555556a42e0) at xrdp_encoder.c:391
#16 0x000055555555f66b in proc_enc_msg (arg=0x5555555ac140) at xrdp_encoder.c:527
#17 0x000055555555f6b2 in proc_enc_msg (arg=<optimized out>) at xrdp_encoder.c:474
#18 0x00007ffff7cba609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#19 0x00007ffff7df6293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

which unfortunately seems to have the same telltales of a buffer overrun as previously.

RolKau commented 2 years ago

If I additionally apply the buffer overrun checks from 9th of December, then it at least could withstand running in "Matrix" mode while I went out and made myself a cup of tea, so it seems that some cases still slip through (without). A comment on why check_and_rfx_encode in commit 887e87f was originally thought to be sufficient may perhaps expose the reason.

Running this version on my main workstation doing some regular tasks, I still get some display glitches (green frames, missing areas), but they seems to be farther apart than before.

Nexarian commented 2 years ago

I was able to reproduce the original bug with v0.9.17, and then tried Matt's patch on a native AMD system with a 3950x. It worked flawlessly, both with matrix mode and with playing a fullscreen YouTube. I will also say the performance is MUCH faster. I used MSTSC to reproduce this, and to turn on RemoteFX with that I set the connection properties to LAN and then checked all the options.

I'm curious -- Why are you cherry picking commits? Why not test Matt's branch directly (and with the librfxcodec patch, which is tricky to install as you have to override the default submodule)? https://github.com/matt335672/xrdp/tree/issue2064

Another thing that might help is recompiling the latest devel of xorgxrdp with "--with-simd" -- The green artifacts are sensitive to the fact that you actually need to build with the right CPU instruction set for the rfx codec to work properly.

RolKau commented 2 years ago

@Nexarian What resolution did you test with? Neither me nor Matt have been able to reproduce this on medium resolutions, such as 2160x1440.

Matt's patch ... performance is MUCH faster

Which one of them? I reckon that any performance increase is due to the fastpath window being larger. If the buffer checks alone increase performance, I think that would indicate that there is a large amount of small buffer overflows that has previously gone unnoticed.

Why are you cherry picking commits?

To make sure that any other fixes doesn't incidentially hide the issue. The first two seems to be necessary to avoid trapping in the debugger immediately, however. I don't find building a custom branch particularily hard: First I reset to v0.9.17, run ./bootstrap to get librfxcodec and then git am the patches.

recompiling the latest devel of xorgxrdp with "--with-simd"

The ./configure screen seems to imply that this is the default, since it is listed as --without-simd.

RolKau commented 2 years ago

@matt335672 It seems that the last trap I reported (at 2021-12-17 20:01) is now a new issue that has been introduced or uncovered: I tried to use the break-in-debugger-trick in the buffer overrun check, i.e. replacing librfxcodec/src/rfxencode_diff_rlgr3.c line 61 with:

if (cdata >= cdata_end) __asm__ volatile("int $0x03");

(with the buffer overrun checks applied) but it never trapped here. Unsure why this patch seemed to help earlier; that may have been due to some randomness.

Instead, this problem happens upon session ending. I was able to create a fairly quick repro case by having this line in ~/.xsession:

/usr/bin/xterm -geometry 3840x2160 -e /bin/sh -c "dd if=/dev/random bs=8192 count=1 | od -w200 -c"

and it could then even be triggered by running the client with a smaller window size:

xfreerdp /sec:rdp /rfx /gfx:rfx /codec-cache:rfx +bitmap-cache +glyph-cache +offscreen-cache /size:2160x1440

Nexarian commented 2 years ago

I built the branch here: https://github.com/matt335672/xrdp/tree/issue2064 and the corresponding change from librfxcodec: https://github.com/matt335672/librfxcodec/tree/issue2068

You're absolutely right that this issue needs 4K or greater to repro. I'll keep trying to see if I can reproduce this with those branches, I'll keep at it, but no luck so far.

I think the compiler flag I was thinking of was --with-simd-amd64 -- Try that!

matt335672 commented 2 years ago

@RolKau - thanks for looking at this in such detail.

The combination of patches you've mentioned above should indeed work as far as I can see. I'll try those on a v0.9.17 branch and see if I can reproduce the problem.

matt335672 commented 2 years ago

I've not managed to reproduce this. Maybe I've got the patching wrong.

I can use this command to get a checksum of all the source files in my build tree:-

cksum $(find . -name \*.[ch]) | sort -k3

Output is in this file file_checksums.txt

@RolKau - can you check with your system? Also, making this patch could give us more info on the buffer size, as we should get the request size output in the xrdp log:-

--- a/libxrdp/xrdp_caps.c
+++ b/libxrdp/xrdp_caps.c
@@ -1195,7 +1195,7 @@ xrdp_caps_send_demand_active(struct xrdp_rdp *self)
         out_uint16_le(s, CAPSSETTYPE_MULTIFRAGMENTUPDATE);
         out_uint16_le(s, CAPSSETTYPE_MULTIFRAGMENTUPDATE_LEN);
         out_uint32_le(s, max_request_size);
-        LOG_DEVEL(LOG_LEVEL_TRACE, "xrdp_caps_send_demand_active: Server Capability "
+        LOG(LOG_LEVEL_INFO, "xrdp_caps_send_demand_active: Server Capability "
                   "CAPSSETTYPE_MULTIFRAGMENTUPDATE = %d", max_request_size);

         /* frame acks */

RolKau commented 2 years ago

I reviewed the checksums. The only five files that were different were:

xrdp_configure_options.h
config_ac.h
librfxcodec/src/amd64/rfxencode_tile_amd64.c
librfxcodec/src/x86/rfxencode_tile_x86.c
librfxcodec/src/rfxencode_diff_rlgr3.c

The first two I reckon stems from me running with CFLAGS="-O2 -ggdb3 -fvar-tracking which is probably different from your setup. The latter three is different from the issue2068 branch in that mine also has the buffer overrun checks from 9th Dec.; otherwise they're the same.

I realized that there was some difference in my testing, so we have another datapoint: The first time, where the buffer overrun checks had any effects, I was using a build of FreeRDP 2.4.0 on my workstation at home, but the second time, when they suddently didn't anymore, I was using a stock Ubuntu build of FreeRDP 2.2.0 from my laptop. Which means that the second problem is sensitive to the way the client communicates with the server, and can explain why this isn't immediately reproducable with MSTSC.

I am more and more convinced that this is really a second issue that is uncovered by rather than caused by, these patches, so maybe this should be a separate issue, and not hold back the patches, which I believe would generally do good. (I would still prefer a comment on the rationale of check_and_rfx_encode, though: Where does the magic constant of 4096*2 come from?)

matt335672 commented 2 years ago

Thanks for that @RolKau

Here's a copy of my files that differ from yours so you can check the differences:-

files.tar.gz

I'm going to look into that magic constant today - I agree it needs explaining and I've just added a comment to my own PR to that effect. After that, as you say, I think we should merge these changes for v0.9.18, barring any other issue which people raise in this area.

RolKau commented 2 years ago

Here's a copy of my files that differ from yours so you can check the differences

I checked out the issue2068 branch and ran cksum on that to see that they were the same as the one you had; then I diff'ed that checked out branch to my own, so I'm certain that the only difference is the buffer overflow check, which I applied after I did the first test.

matt335672 commented 2 years ago

Thanks.

For what it's worth I just instrumented the code and ran 10 mins or so of the test in 'matrix mode' and the biggest tile I saw was 3037 bytes, well under the 8192 'magic number' limit. That suggests the compression is working rather well in this mode, perhaps unsurprisingly with just two colours.

matt335672 commented 2 years ago

Tried a few videos and screen savers. Managed a tile size of 5774, so although the 8192 limit is working, I don't think it's necessarily correct.

trishume commented 2 years ago

Yah see https://github.com/neutrinolabs/librfxcodec/pull/48 where I comment how I think I made up the limit based on an apparently mistaken impression of the largest possible tile size

Nexarian commented 2 years ago

You can get a full readout of the diff between FreeRDP 2.2.0 and 2.4.0 by cloning into it and running this:

git diff 2.2.0..2.4.0

Time to dig into it to see what might matter!

RolKau commented 2 years ago

I added a stack frame prologue to function rfxcodec_encode_dwt_shift_amd64_sse41 at line 1207 in rfxcodec_encode_dwt_shift_amd64_sse41.asm

push rbp
mov rbp, rsp

and the corresponding epilogue at line 1354 before the ret instruction:

mov rsp, rbp
pop rbp

This makes gdb able to generate a proper backtrace when the segmentation fault occurs inside this function, and I could then get a better trace of the logoff error:

#0  0x00005555555880b3 in rfxcodec_encode_dwt_shift_amd64_sse41 ()
#1  0x000055555558847c in rfx_encode_component_rlgr3_amd64_sse41 (enc=0x555555602240, qtable=0x55555558dedc <g_rfx_default_quantization_values> "ffw\210\230", data=0x7ffff5c8b000 <error: Cannot access memory at address 0x7ffff5c8b000>, buffer=0x7ffff4f141f4 "", buffer_size=3108636, size=0x7ffff664878c) at rfxencode_tile_amd64.c:107
#2  0x000055555557fbc4 in check_and_rfx_encode (enc=enc@entry=0x555555602240, qtable=qtable@entry=0x55555558dedc <g_rfx_default_quantization_values> "ffw\210\230", data=data@entry=0x7ffff5c8b000 <error: Cannot access memory at address 0x7ffff5c8b000>, buffer=0x7ffff4f141f4 "", buffer_size=3108636, size=size@entry=0x7ffff664878c) at rfxencode_tile.c:110
#3  0x000055555558011b in rfx_encode_yuv (enc=enc@entry=0x555555602240, yuv_data=yuv_data@entry=0x7ffff5c89000 <error: Cannot access memory at address 0x7ffff5c89000>, width=width@entry=64, height=height@entry=64, stride_bytes=stride_bytes@entry=8704, y_quants=y_quants@entry=0x55555558dedc <g_rfx_default_quantization_values> "ffw\210\230", u_quants=0x55555558dedc <g_rfx_default_quantization_values> "ffw\210\230", v_quants=0x55555558dedc <g_rfx_default_quantization_values> "ffw\210\230", data_out=0x7ffff6648980, y_size=0x7ffff6648784, u_size=0x7ffff6648788, v_size=0x7ffff664878c) at rfxencode_tile.c:250
#4  0x000055555557e2b0 in rfx_compose_message_tile_yuv (enc=enc@entry=0x555555602240, s=s@entry=0x7ffff6648980, tile_data=tile_data@entry=0x7ffff5c89000 <error: Cannot access memory at address 0x7ffff5c89000>, tile_width=tile_width@entry=64, tile_height=tile_height@entry=64, stride_bytes=stride_bytes@entry=8704, quantVals=0x55555558dedc <g_rfx_default_quantization_values> "ffw\210\230", quantIdxY=0, quantIdxCb=0, quantIdxCr=0, xIdx=24, yIdx=19) at rfxencode_compose.c:232
#5  0x000055555557f334 in rfx_compose_message_tileset (enc=enc@entry=0x555555602240, s=s@entry=0x7ffff6648980, buf=buf@entry=0x7ffff5211000 <error: Cannot access memory at address 0x7ffff5211000>, width=width@entry=2176, height=height@entry=1472, stride_bytes=stride_bytes@entry=8704, tiles=0x7ffff520b110, num_tiles=782, quants=0x0, num_quants=0, flags=0) at rfxencode_compose.c:497
#6  0x000055555557f9a0 in rfx_compose_message_data (enc=enc@entry=0x555555602240, s=s@entry=0x7ffff6648980, regions=regions@entry=0x7ffff5210698, num_regions=num_regions@entry=1, buf=buf@entry=0x7ffff5211000 <error: Cannot access memory at address 0x7ffff5211000>, width=width@entry=2176, height=1472, stride_bytes=8704, tiles=0x7ffff520b110, num_tiles=782, quants=0x0, num_quants=0, flags=0) at rfxencode_compose.c:609
#7  0x000055555557d6eb in rfxcodec_encode_ex (handle=handle@entry=0x555555602240, cdata=cdata@entry=0x7ffff4f0b110 "\300\314\f", cdata_bytes=cdata_bytes@entry=0x7ffff6648a80, buf=buf@entry=0x7ffff5211000 <error: Cannot access memory at address 0x7ffff5211000>, width=width@entry=2176, height=height@entry=1472, stride_bytes=8704, regions=0x7ffff5210698, num_regions=1, tiles=0x7ffff520b110, num_tiles=782, quants=0x0, num_quants=0, flags=0) at rfxencode.c:334
#8  0x000055555557d780 in rfxcodec_encode (handle=0x555555602240, cdata=cdata@entry=0x7ffff4f0b110 "\300\314\f", cdata_bytes=cdata_bytes@entry=0x7ffff6648a80, buf=0x7ffff5211000 <error: Cannot access memory at address 0x7ffff5211000>, width=2176, height=1472, stride_bytes=8704, regions=0x7ffff5210698, num_regions=1, tiles=0x7ffff520b110, num_tiles=782, quants=0x0, num_quants=0) at rfxencode.c:350
#9  0x0000555555563cf6 in process_enc_rfx (self=0x5555555f88b0, enc=0x5555556c07e0) at xrdp_encoder.c:391
#10 0x0000555555564068 in proc_enc_msg (arg=0x5555555f88b0) at xrdp_encoder.c:527
#11 0x00007ffff7cba609 in start_thread (arg=<optimized out>) at pthread_create.c:477
#12 0x00007ffff7df6293 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

(The line numbers here are probably a bit off due to various checks I have added). The buf parameter is apparently the one that gets hit. If I break at frame #9 in process_enc_rfx upon startup, the address is valid at that point. But still it traps at line 487 in rfxencode_compose.c when adding an access to tile_data there, before entering rfx_compose_message_tile_yuv. The index is then in the high six hundreds (out of 782), but varies from time to time.

With x = 320, y = 1024 and stride_bytes = 8704 we get tile_data - buf = 262'144 * 34 + 81'920 = 8'994'816.

None of these numbers seems particularily off, but I don't know where to get the allocated size of enc->data (which process_enc_rfx passes as buf).

RolKau commented 2 years ago

By the way, this latest problem doesn't seem to be dependent on the resolution: I can reproduce it by having in ~/.xsession (on the server):

/usr/bin/xterm -geometry 1920x1080 -e /bin/sh -c "dd if=/dev/urandom bs=8192 count=1 | od -w100 -c"

and the connect (at the client) with:

xfreerdp /sec:rdp /rfx /gfx:rfx /codec-cache:rfx +bitmap-cache +glyph-cache +offscreen-cache /size:1920x1080

although it alternates by either waiting until I close the client window or crashing with a segmentation fault fairly consistently every other attempt (?)

matt335672 commented 2 years ago

@RolKau - I'm out of time today, what with it being the time of year it is, but if I'm following you correctly, the memory will be allocated at the location identified in this comment above.

The buffer size is mostly controlled by self->max_compressed_bytes in that location, which is a value received from the client. As I commented here this should be at least as big as the size which we are sending, but maybe for freerdp 2.2 this is different.

Where are you getting your xfreerdp 2.2 client from? I can try to reproduce what you're seeing with the same client.

RolKau commented 2 years ago

I think we may be talking about different things here; I don't see the immediate relation between enc->data and alloc_bytes from line 349 to 352 in xrdp_encoder.c.

Where are you getting your xfreerdp 2.2 client from?

This is the stock xfreerdp client in Ubuntu Focal 20.04.2.

matt335672 commented 2 years ago

Quite right - there's no relation. That's what comes from commenting in a hurry - sorry.

The enc->data pointer gets added to a fifo entry in server_paint_rects in xrdp_mm.c. So I'd think this is either invalid at this point, or it gets written over while it's sitting in the fifo. Am I understanding where you are now?

If so, it might make sense to use one of your INT3 data read checks to check the data pointer when it's added to the fifo, with something like:-

--- a/xrdp/xrdp_mm.c
+++ b/xrdp/xrdp_mm.c
@@ -3223,6 +3223,7 @@ server_paint_rects(struct xrdp_mod *mod, int num_drects, short *drects,
         enc_data->mod = mod;
         enc_data->num_drects = num_drects;
         enc_data->num_crects = num_crects;
+ if (*data == 0x56 && *(data+1) == 0x45) __asm__ volatile("int $0x03");
         enc_data->data = data;
         enc_data->width = width;
         enc_data->height = height;

Thanks for the client update. I'll give this a go myself.

neutrinolabs / xrdp

Segmentation fault when writing lots of glyphs with RemoteFX codec #2068