Open bmr-cymru opened 5 years ago
+1 to have this implemented, just unsure how - how to prevent splitting output to chunks individually loaded into memory in such a way, that a regexp to replace is split over two or more chunks?
If we prevent multiline regexps, we can iterate over splitlines()
and apply the regexp separately. For certificate match, we can buffer lines for a while, once we identified beginning of a certificate and until we find out end-line of the certificate). This can work for more types of multiline regexps - we just have to be able to match 1st and latest regexp line to an input line.
Does this approach and its limitation sounds feasible?
That's basically it: ideally, we would read and write fixed-sized blocks of data, regardless of file content, but this is difficult with the standard Python regex modules. There's no way to "feed" a matcher a stream of input data and be notified when something in the stream matches (there are ways to do this in general, but they are non-trivial to implement and do not support the full set of regular expression notation).
Failing that, the next best option is to disallow multi-line regexes and to process the files line-by-line. This is fairly simple (except for the need to find and fix existing expressions in plugins) but since lines can be arbitrarily long it's still possible for us to read large quantities of data into memory in some circumstances (e.g. a large JSON file encoded without embedded newlines).
This will involve revisiting the methods for scrubbing certificate type content due to the problems you point out but this should be fairly simple.
Can't we implement do_path_regex_sub
via sos clean SoSKeywordParser (cf. https://github.com/sosreport/sos/pull/2093/commits/ff568f1c0ee88d352a68fa3e7f20ae13c23bf726)?
That allows line-by-line processing of data. But:
obfuscatedword*
strings instead of asterisks (not huge deal, it can be seen as an improvement)sos clean
applies all parsers for each line, while we wanted just the keyword one (maybe obfuscate_file
might have a new parsers
argument to apply just those parsers?)I think a sufficient resolution will be applying some line-by-line processing with two "exceptions":
\n
would be a known limitation - usually we dont collect huge such filesIdeally, we should do an inline replacement instead of building a new file and copying the whole content to the original location (can't this make an implementation tricky?).
Two approaches: 1) replace the code like https://github.com/sosreport/sos/blob/master/sos/report/plugins/__init__.py#L890-L894 by something like:
with open(path) as infile:
for line in infile:
apply_regexp(line)
2) Use SoSKeyWordParser
and SoSKeywordMap
that basically mimics what we want. It would have to be enhanced in two ways:
obfuscatedword
in https://github.com/sosreport/sos/blob/master/sos/cleaner/mappings/keyword_map.py#L27 should be customized, as we offer replacement string now - but I think we can drop this "feature" and use one common scrubbed text pattern every time. Note that using the SoSKeywordMap
will gain an extra benefit that same passwords in multiple files will be replaced by the same scrubbed string - that can help spotting some misconfig issues in a sos archive that we cant do now.SoSKeyWordParser
works on string search, we need a RE search.Just for my reference: list of code snippets we need to change:
do_file_private_sub
for multiline replacements)@TurboTurtle or @bmr-cymru or @slashdd , any preference in the "inline replacement or not" and in the approach?
I generally like the idea of using the keyword parser for this, but my concern is with this:
SoSKeyWordParser works on string search, we need a RE search.
The keyword parser deliberately does not use a regex match so that we don't obfuscate substrings that could then potentially leak what the obfuscated keyword is. So we'd either need to implement a new keyword-like parser (not a big fan of that idea), or be very careful in the way we craft the regex matches (potentially including behind-the-scenes modification of regexes, so non-trivial).
Replacing obfuscatedwordX
with a standardized *******
or similar is a good enhancement for plugin-based scrubbing. We just need to ensure:
--clean
pass.I generally like the idea of using the keyword parser for this, but my concern is with this:
SoSKeyWordParser works on string search, we need a RE search.
The keyword parser deliberately does not use a regex match so that we don't obfuscate substrings that could then potentially leak what the obfuscated keyword is. So we'd either need to implement a new keyword-like parser (not a big fan of that idea), or be very careful in the way we craft the regex matches (potentially including behind-the-scenes modification of regexes, so non-trivial).
Could you please elaborate here? Expecting the KeyWordParser will be called by e.g. do_cmd_output_sub
to replace given R.E. (by asterisks or an obfuscatedWord001
like test), what leakage of substrings do you refer to?
Replacing
obfuscatedwordX
with a standardized*******
or similar is a good enhancement for plugin-based scrubbing. We just need to ensure:
No, I meant here the oposite direction :) Use the current scrubbed text obfuscatedwordX
also by the plugin-based obfuscations. Like we forcefully replace calls like:
self.do_cmd_output_sub( "crm configure show", r"passw([^\s=]*)=\S+", r"passw\1=********" )
by:
self.do_cmd_output_sub( "crm configure show", r"passw([^\s=]*)=\S+", r"\2" )
telling the do_cmd_output_sub
to scrub exactly the string in \2
backreference - not by asterisks, but by obfuscatedStringX
(and the parser can keep the map of strings we replaced, such that same string (e.g. the same password occurring on multiple places) will be always replaced by the same obfuscatedStringX
.
This can help to spot potential misconfigurations where the same password is supposed to be stored on two places, but they keep different values, like:
/etc/foo.conf: password=someSecret
/etc/bar.conf: fooPassword=somethingOther
. Currently we replace either password by asterisks so we cant check for equivalence of the password strings. By proper usage of keyword parser, we should see:
/etc/foo.conf: password=obfuscatedString7
/etc/bar.conf: fooPassword=obfuscatedString9
and we will know - even after the obfuscation - that there is a misconfig problem behind.
My problem with saving the plugin-matched strings is this
/etc/sos/cleaner/default_mapping
, or any individual private mapping files, having every password we come across stored in plain text in it. That then becomes a very attractive target for ex-filtration, especially if a default_mapping
is built up over time on the same system. obfuscatedwordX
string. Now, even in an obfuscated sos report, a malicious actor could at the very least determine that the same password is being re-used in multiple places. Even if those passwords are scrubbed, that's valuable data.EDIT: I think one course we have here would be to loop in security teams/minds from Red Hat and Canonical for input on this.
My problem with saving the plugin-matched strings is this
1. I'm not a fan of having `/etc/sos/cleaner/default_mapping`, or any individual private mapping files, having every password we come across stored in plain text in it. That then becomes a very attractive target for ex-filtration, especially if a `default_mapping` is built up over time on the same system.
This makes sense. I agree that if we used a direct cleaner cleaner/mapper for these purposes, we shouldnt write the mapping anywhere due to that reason. And generate it on the fly on every sos run. Which makes subsequent sosreport archives inconsistent wrt. the mapping - hence my proposed feature is less usable.
2. If a password is used twice, then both will map to the same `obfuscatedwordX` string. Now, even in an obfuscated sos report, a malicious actor could at the very least determine that the same password is being re-used in multiple places. Even if those passwords are scrubbed, that's valuable data.
EDIT: I think one course we have here would be to loop in security teams/minds from Red Hat and Canonical for input on this.
Good point if we would go the obfuscatedwordX
way. Now it rather seems that 1) the feature would have its limitation (on inconsistent mapping on subsequent sos runs), 2) we would have to customize/parametrize the cleaner in several ways such that I do not see much benefit in this approach. As the "code deduplication" would require more amending of cleaner just for this use case, than the duplicit/similar code size is.
Until there is some simple enhancement of sos cleaner, I would prefer (not demand, just prefer) "in place" changes in Plugin
class directly.
Bit related point: re.subn
(that we use nowadays) is tragically inefficient in processing long lines. There is a real use case behind this that I restricted to re.subn
performance problem.
Having a file with 100k chars on one line (like JSON without newlines), it starts to affect overall execution times. See testing script:
import codecs
import sys
import re
if (len(sys.argv) < 2):
print("missing argument with filename")
content = codecs.open(sys.argv[1], "r", encoding='utf-8', errors='ignore').read()
regexp = r"(.*password=)(\S*)"
subst = "\1********"
_, replacements = re.subn(regexp, subst, content)
print("%s replacements" % replacements)
Then prepare a 100k file, once split to 1000lines and once one the only line:
for i in $(seq 1 1000); do echo "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"; done > aas.100k.txt
cat aas.100k.txt | tr '\n' ' ' > aas.100k.oneline.txt
Compare the script on both inputs:
# time python3 test_re_subn.py aas.txt
0 replacements
real 0m0.109s
user 0m0.050s
sys 0m0.040s
#
# time python3 test_re_subn.py aas.oneline.txt
0 replacements
real 0m7.942s
user 0m7.924s
sys 0m0.014s
#
Having 200kB file instead of 100kB, oneline file took 50seconds(!) instead of 0.1s for multi-line input.
So when we will implement this issue, we shall get into consideration also this huge inefficiency.
Taking it back, our fault in virtwho
plugin: #2527.
Currently the file and command output substitution methods in
sos.plugins.Plugin
read the entire file into memory before applying substitutions and re-writing the output:do_cmd_output_sub()
,do_cmd_private_sub()
do_file_sub()
,do_file_private_sub(),
do_path_regex_sub()`This leads to memory pressure when processing large files or command output since the entire content is read into memory before processing.