tukaani-project / xz

XZ Utils
https://tukaani.org/xz/
Other
503 stars 40 forks source link

Mask control characters in filenames #118

Open Larhzu opened 2 months ago

Larhzu commented 2 months ago

The command line tools print filenames and some other user-specified strings to standard output or standard error. Malicious strings, for example, from filenames could contain control characters that affect the state of the terminal.

These commits add a function to replace the single-byte control characters with question marks. This is simple but hopefully good enough in practice.

Larhzu commented 2 months ago

Single-byte control character masking isn't enough. At least Konsole and Xfce Terminal (but not uxterm) interpret C1 control codes and CSI sequences in en_US.UTF-8 locale.

$ printf 'foo\u009b3Dbar\n'
bar

$ printf 'a\u0090 Can you see me? \u009cb\n'
ab

A proper masking method must decode multibyte characters. It must tolerate invalid multibyte sequences and restart decoding from the next byte.

Larhzu commented 2 months ago

The new version should handle all relevant multibyte character sets, not just UTF-8.

Instead of looking for control characters, it now looks for non-printable characters which is a much stricter check. A possible downside is that an old C library might not recognize newer printable Unicode characters even though the user might be using them already. I suppose it's not a real problem. :-) Gnulib's quotearg looks for printable characters as well.

tuklib_mask_nonprint isn't thread safe although it would be straightforward to add a variant that takes a char **mem argument. That wouldn't require C11's optional feature thread_local from <threads.h> or a compiler-specific thread-local extension (although those are quite widely available). Then it would be possible to have more than one masked string available at the same time in case one needs to print two filenames at once. But the simplest version is enough for now.

I hope I didn't miss any string that should be masked.