[plugin, archive] do not read entire file at once when performing substitutions

sosreport / sos

A unified tool for collecting system logs and other debug information

http://sos.rtfd.org

GNU General Public License v2.0

508 stars 543 forks source link

[plugin, archive] do not read entire file at once when performing substitutions #1836

Open bmr-cymru opened 4 years ago

bmr-cymru commented 4 years ago

Currently the file and command output substitution methods in sos.plugins.Plugin read the entire file into memory before applying substitutions and re-writing the output:

do_cmd_output_sub(), do_cmd_private_sub()
do_file_sub(), do_file_private_sub(),do_path_regex_sub()`

This leads to memory pressure when processing large files or command output since the entire content is read into memory before processing.

pmoravec commented 4 years ago

+1 to have this implemented, just unsure how - how to prevent splitting output to chunks individually loaded into memory in such a way, that a regexp to replace is split over two or more chunks?

If we prevent multiline regexps, we can iterate over splitlines() and apply the regexp separately. For certificate match, we can buffer lines for a while, once we identified beginning of a certificate and until we find out end-line of the certificate). This can work for more types of multiline regexps - we just have to be able to match 1st and latest regexp line to an input line.

Does this approach and its limitation sounds feasible?

bmr-cymru commented 4 years ago

That's basically it: ideally, we would read and write fixed-sized blocks of data, regardless of file content, but this is difficult with the standard Python regex modules. There's no way to "feed" a matcher a stream of input data and be notified when something in the stream matches (there are ways to do this in general, but they are non-trivial to implement and do not support the full set of regular expression notation).

Failing that, the next best option is to disallow multi-line regexes and to process the files line-by-line. This is fairly simple (except for the need to find and fix existing expressions in plugins) but since lines can be arbitrarily long it's still possible for us to read large quantities of data into memory in some circumstances (e.g. a large JSON file encoded without embedded newlines).

This will involve revisiting the methods for scrubbing certificate type content due to the problems you point out but this should be fairly simple.

pmoravec commented 4 years ago

Can't we implement do_path_regex_sub via sos clean SoSKeywordParser (cf. https://github.com/sosreport/sos/pull/2093/commits/ff568f1c0ee88d352a68fa3e7f20ae13c23bf726)?

That allows line-by-line processing of data. But:

it replaces sensitive data by obfuscatedword* strings instead of asterisks (not huge deal, it can be seen as an improvement)
sos clean applies all parsers for each line, while we wanted just the keyword one (maybe obfuscate_file might have a new parsers argument to apply just those parsers?)
it always creates a new file and copies (while applying scrubbing) the original file there - this might be another overhead? (or is in-place obfuscation difficult?)
we would deal with multi-lined SSL certs scrubbing either way - but here the current implementation can suffice (usually, we scrub certs in configs that are not so huge, so no big memory impact to load whole files into memory)

pmoravec commented 3 years ago

I think a sufficient resolution will be applying some line-by-line processing with two "exceptions":

SSL certificates will be obfuscated as today - usually, they are in smaller files where loading them into memory isnt a big deal
processing files with long lines (like JSON without \n would be a known limitation - usually we dont collect huge such files

Ideally, we should do an inline replacement instead of building a new file and copying the whole content to the original location (can't this make an implementation tricky?).

Two approaches: 1) replace the code like https://github.com/sosreport/sos/blob/master/sos/report/plugins/__init__.py#L890-L894 by something like:

    with open(path) as infile:
        for line in infile:
            apply_regexp(line)

2) Use SoSKeyWordParser and SoSKeywordMap that basically mimics what we want. It would have to be enhanced in two ways:

obfuscatedword in https://github.com/sosreport/sos/blob/master/sos/cleaner/mappings/keyword_map.py#L27 should be customized, as we offer replacement string now - but I think we can drop this "feature" and use one common scrubbed text pattern every time. Note that using the SoSKeywordMap will gain an extra benefit that same passwords in multiple files will be replaced by the same scrubbed string - that can help spotting some misconfig issues in a sos archive that we cant do now.
SoSKeyWordParser works on string search, we need a RE search.

Just for my reference: list of code snippets we need to change:

https://github.com/sosreport/sos/blob/master/sos/report/plugins/__init__.py#L890-L894
https://github.com/sosreport/sos/blob/master/sos/report/plugins/__init__.py#L953-L961 (with the exception of do_file_private_sub for multiline replacements)

@TurboTurtle or @bmr-cymru or @slashdd , any preference in the "inline replacement or not" and in the approach?

TurboTurtle commented 3 years ago

I generally like the idea of using the keyword parser for this, but my concern is with this:

SoSKeyWordParser works on string search, we need a RE search.

The keyword parser deliberately does not use a regex match so that we don't obfuscate substrings that could then potentially leak what the obfuscated keyword is. So we'd either need to implement a new keyword-like parser (not a big fan of that idea), or be very careful in the way we craft the regex matches (potentially including behind-the-scenes modification of regexes, so non-trivial).

Replacing obfuscatedwordX with a standardized ******* or similar is a good enhancement for plugin-based scrubbing. We just need to ensure:

This is only done for plugins, and cannot be done during an actual --clean pass.
Nothing scrubbed by plugins ever finds its way into a recorded mapping. This shouldn't be difficult, as we just need to not call anything behind the keyword parser here. The benefit is that the 'persistent' checks for already-obfuscated items from other plugins would still happen.

pmoravec commented 3 years ago

I generally like the idea of using the keyword parser for this, but my concern is with this:

SoSKeyWordParser works on string search, we need a RE search.

The keyword parser deliberately does not use a regex match so that we don't obfuscate substrings that could then potentially leak what the obfuscated keyword is. So we'd either need to implement a new keyword-like parser (not a big fan of that idea), or be very careful in the way we craft the regex matches (potentially including behind-the-scenes modification of regexes, so non-trivial).

Could you please elaborate here? Expecting the KeyWordParser will be called by e.g. do_cmd_output_sub to replace given R.E. (by asterisks or an obfuscatedWord001 like test), what leakage of substrings do you refer to?

Replacing obfuscatedwordX with a standardized ******* or similar is a good enhancement for plugin-based scrubbing. We just need to ensure:

No, I meant here the oposite direction :) Use the current scrubbed text obfuscatedwordX also by the plugin-based obfuscations. Like we forcefully replace calls like:

self.do_cmd_output_sub( "crm configure show", r"passw([^\s=]*)=\S+", r"passw\1=********" )

by:

self.do_cmd_output_sub( "crm configure show", r"passw([^\s=]*)=\S+", r"\2" )

telling the do_cmd_output_sub to scrub exactly the string in \2 backreference - not by asterisks, but by obfuscatedStringX (and the parser can keep the map of strings we replaced, such that same string (e.g. the same password occurring on multiple places) will be always replaced by the same obfuscatedStringX.

This can help to spot potential misconfigurations where the same password is supposed to be stored on two places, but they keep different values, like:

/etc/foo.conf: password=someSecret /etc/bar.conf: fooPassword=somethingOther

. Currently we replace either password by asterisks so we cant check for equivalence of the password strings. By proper usage of keyword parser, we should see:

/etc/foo.conf: password=obfuscatedString7 /etc/bar.conf: fooPassword=obfuscatedString9

and we will know - even after the obfuscation - that there is a misconfig problem behind.

TurboTurtle commented 3 years ago

My problem with saving the plugin-matched strings is this

I'm not a fan of having /etc/sos/cleaner/default_mapping, or any individual private mapping files, having every password we come across stored in plain text in it. That then becomes a very attractive target for ex-filtration, especially if a default_mapping is built up over time on the same system.
If a password is used twice, then both will map to the same obfuscatedwordX string. Now, even in an obfuscated sos report, a malicious actor could at the very least determine that the same password is being re-used in multiple places. Even if those passwords are scrubbed, that's valuable data.

EDIT: I think one course we have here would be to loop in security teams/minds from Red Hat and Canonical for input on this.

pmoravec commented 3 years ago

My problem with saving the plugin-matched strings is this

1. I'm not a fan of having `/etc/sos/cleaner/default_mapping`, or any individual private mapping files, having every password we come across stored in plain text in it. That then becomes a very attractive target for ex-filtration, especially if a `default_mapping` is built up over time on the same system.

This makes sense. I agree that if we used a direct cleaner cleaner/mapper for these purposes, we shouldnt write the mapping anywhere due to that reason. And generate it on the fly on every sos run. Which makes subsequent sosreport archives inconsistent wrt. the mapping - hence my proposed feature is less usable.

2. If a password is used twice, then both will map to the same `obfuscatedwordX` string. Now, even in an obfuscated sos report, a malicious actor could at the very least determine that the same password is being re-used in multiple places. Even if those passwords are scrubbed, that's valuable data.

EDIT: I think one course we have here would be to loop in security teams/minds from Red Hat and Canonical for input on this.

Good point if we would go the obfuscatedwordX way. Now it rather seems that 1) the feature would have its limitation (on inconsistent mapping on subsequent sos runs), 2) we would have to customize/parametrize the cleaner in several ways such that I do not see much benefit in this approach. As the "code deduplication" would require more amending of cleaner just for this use case, than the duplicit/similar code size is.

Until there is some simple enhancement of sos cleaner, I would prefer (not demand, just prefer) "in place" changes in Plugin class directly.

pmoravec commented 3 years ago

Bit related point: re.subn (that we use nowadays) is tragically inefficient in processing long lines. There is a real use case behind this that I restricted to re.subn performance problem.

Having a file with 100k chars on one line (like JSON without newlines), it starts to affect overall execution times. See testing script:

import codecs
import sys
import re
if (len(sys.argv) < 2):
  print("missing argument with filename")

content = codecs.open(sys.argv[1], "r", encoding='utf-8', errors='ignore').read()
regexp = r"(.*password=)(\S*)"
subst = "\1********"
_, replacements = re.subn(regexp, subst, content)
print("%s replacements" % replacements)

Then prepare a 100k file, once split to 1000lines and once one the only line:

for i in $(seq 1 1000); do echo "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"; done > aas.100k.txt
cat aas.100k.txt | tr '\n' ' ' > aas.100k.oneline.txt

Compare the script on both inputs:

# time python3 test_re_subn.py aas.txt 
0 replacements

real    0m0.109s
user    0m0.050s
sys 0m0.040s
#
# time python3 test_re_subn.py aas.oneline.txt 
0 replacements

real    0m7.942s
user    0m7.924s
sys 0m0.014s
#

Having 200kB file instead of 100kB, oneline file took 50seconds(!) instead of 0.1s for multi-line input.

So when we will implement this issue, we shall get into consideration also this huge inefficiency.

pmoravec commented 3 years ago

Taking it back, our fault in virtwho plugin: #2527.