Open pmoravec opened 3 months ago
MAC addresses have instilled in me a new level of loathing for regex.
I was also thinking "aren't we reinventing wheel?"
A bit, yes. The issue is largely that the stock examples commonly found assume that the string to be matched against is a discreet unit having a straight forward comparison to make. What I mean by this practically can be seen by looking at your example source - the test strings are nice and simple cases, but what we found originally when implementing cleaner many years ago, is that these regexes largely fall on their face when given log content.
Logs will have the nice and simple use case of a MAC address or similar separated by spaces, but they also have:
etc...etc... you get the idea.
Why the three formats..?
IPv6 mac addresses have 2 accepted formats, aa:bb:cc:dd:ee:ff::ab:cd
and aabb:ccdd:eeff:abcd
. IPv4 mac addresses (aa:bb:cc:dd:ee:ff
) can be found as substrings of IPv6 addresses of the first format, but we don't want to match on those.
But can't we 1) enforce just one type of separator is used, no mixed ones
We could, and then we'd likely have things slipping through the cracks and end users (rightfully) complaining that we're not obfuscating everything.
If the extended look-behinds don't interfere with our ability to obfuscate addresses that aren't simply bracketed by spaces, for example my log message has value=aa:bb:cc:dd:ee:ff:ab:cd
, then by all means let's improve this.
When we talk about the IPv6 mac addresses, do we mean the MAC address included in the IPv6 address? If this is the case (which it shouldn't be applied anymore because of https://datatracker.ietf.org/doc/html/rfc7217), then shouldn't the IPv6 regex take care of that?
I tried to come up with some regexp that will cover all my points (dont match UUIDs, force same separator at all positions) but I lack specification of the IPv6 mac addresses - or specification what we should obfuscate and what not.
So I wrote a simple script with test cases with strings that should (or not) be obfuscated:
#!/usr/bin/python3
import re, sys
IPV6_REG_4HEX_OLD = (
r'((?<!([0-9a-fA-F\'\"]:)|::)(([^:\-]?[0-9a-fA-F]{4}(:|-)){3}'
r'[0-9a-fA-F]{4}(\'|\")?(\/|\,|\-|\.|\s|$)))'
)
IPV6_REG_4HEX_NEW = (
r'((?<![[0-9a-fA-F:]:|\w])(([0-9a-fA-F]{4}:){3}[0-9a-fA-F]{4})(?![:|\w]))|'
r'((?<![[0-9a-fA-F-]-|\w])(([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{4})(?![-|\w]))'
)
IPV6_REG_4HEX = IPV6_REG_4HEX_OLD if (len(sys.argv)>1 and sys.argv[1]=="old") else IPV6_REG_4HEX_NEW
test_strings = (
(False, '/pulp/api/v3/content/rpm/packages/0190fe17-d1ae-7091-a034-22ade2121dc0/'),
(False, '0190fe17-d1ae-7091-a034-22ade2121dc0'),
(True, 'aabb:ccee:ddee:ffaa'),
(True, 'line=aabb:ccee:ddee:ffaa'),
(True, 'line=aabb:ccee:ddee:ffaa, something else'),
(True, 'address:aabb:ccee:ddee:ffaa'),
(True, 'addr=[aabb:ccee:ddee:ffaa]'),
(True, 'addr="aabb:ccee:ddee:ffaa"'),
(True, "addr='aabb:ccee:ddee:ffaa'"),
(True, 'aabb:ccee:ddee:ffaa, but not aa:bb:cc:dd'),
(True, 'line=aabb:ccee:ddee:ffaa-should still match'),
(False, 'not aabb:ccee:ddee:ffaaaaaa'),
# (, ''),
# (, ''),
# (, ''),
)
intro=f"Matching against {IPV6_REG_4HEX}"
print(intro)
print("-"*len(intro))
print()
for match, line in test_strings:
search = re.search(IPV6_REG_4HEX, line)
print(f"{line}\t:\t\t{search}")
if (None!=search) != match:
print(f"\t\tERROR {line} should {'' if match else 'not '}match, but re.search={search}")
When running with argument old
, current RE is used. When running without an argument, the new RE:
IPV6_REG_4HEX_NEW = (
r'((?<![[0-9a-fA-F:]:|\w])(([0-9a-fA-F]{4}:){3}[0-9a-fA-F]{4})(?![:|\w]))|'
r'((?<![[0-9a-fA-F-]-|\w])(([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{4})(?![-|\w]))'
)
is used.
Please show me examples where the new RE does not work as you expect - in either way.
logged_mac:aabb:ccee:ddee:ffaa=some_other_data : None
ERROR logged_mac:aabb:ccee:ddee:ffaa=some_other_data should match, but re.search=None
To be fair, it appears our current regex also fails here (which I thought we had previously fixed) - but this is the kind of logging that I'm referring to. Where there are no spaces breaking up the mac addr from the rest of the string.
Also,
logged_mac:aabb:ccee:ddee:ffaa some_other_data : None
ERROR logged_mac:aabb:ccee:ddee:ffaa some_other_data should match, but re.search=None
This fails when there is a trailing space.
Current cleaner fails to identify either MAC address, now. SO at least there is no regression :) In both cases, the key problem is the prefix before the address.
OK, I will come up with something better. But why the current R.E. even has the "MAC address can't follow after ::
or after [0-9a-fA-F\'\"]:
" condition? I just tried to respect this, to prevent regression. What strings (starting by e.g. ::
followed by "looks-like MAC address") do not contain a MAC address? Does e.g. ::aabb:ccee:ddee:ffaa
contain a MAC address? Current regexp does not think so.
So what can not be before the match?
What about simple:
IPV6_REG_4HEX_NEW = (
r'((?<!\w)(([0-9a-fA-F]{4}:){3}[0-9a-fA-F]{4})(?!\w))|'
r'((?<!\w)(([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{4})(?!\w))'
)
? I.e. disallow just word characters prior and after the MAC address.
All these tests passed (I added a few at the end):
test_strings = (
(False, '/pulp/api/v3/content/rpm/packages/0190fe17-d1ae-7091-a034-22ade2121dc0/'),
(False, '0190fe17-d1ae-7091-a034-22ade2121dc0'),
(True, 'aabb:ccee:ddee:ffaa'),
(True, 'line=aabb:ccee:ddee:ffaa'),
(True, 'line=aabb:ccee:ddee:ffaa, something else'),
(True, 'address:aabb:ccee:ddee:ffaa'),
(True, 'addr=[aabb:ccee:ddee:ffaa]'),
(True, 'addr="aabb:ccee:ddee:ffaa"'),
(True, "addr='aabb:ccee:ddee:ffaa'"),
(True, 'aabb:ccee:ddee:ffaa, but not aa:bb:cc:dd'),
(True, 'line=aabb:ccee:ddee:ffaa-should still match'),
(False, 'not aabb:ccee:ddee:ffaaaaaa'),
(True, 'logged_mac:aabb:ccee:ddee:ffaa=some_other_data'),
(True, 'logged_mac:aabb:ccee:ddee:ffaa some_other_data'),
(True, '::aabb:ccee:ddee:ffaa'),
(True, 'we should obfuscate addr: aabb:ccee:ddee:ffaa: YES'),
)
LGTM, let's get that PR open :)
User story: cleaner runs for three days. The cause is it cleans postgres logs like:
with a list of >10k UUIDs - and MAC cleaner treats all the UUIDs as
IPV6_REG_4HEX
address (cf https://github.com/sosreport/sos/blob/main/sos/cleaner/parsers/mac_parser.py#L23-L26):There are three issues here:
(?<!([0-9a-fA-F\'\"]:)|::)
must be enhanced to e.g.(?<!([0-9a-fA-F\'\"]:)|::|[0-9a-fA-F])
(or to(?<!([0-9a-fA-F\'\"]:)|::|\w)
?22ade2121dc0
) - no big idea how to modify this (maybe remove the|\-
if-
is used as sub-address separator?)-
as sub-addresses separator (shouldnt we support dot also, per https://www.geeksforgeeks.org/mac-address-in-computer-network/#format-of-mac-address?) - but can't we 1) enforce just one type of separator is used, no mixed ones, 2) no separator just before or just after the match (cf previous point)?I was also thinking "aren't we reinventing wheel?" but I haven't found an ultimate regular expression for MAC address :-o (do I use DuckDuckGo wrong?).
EDIT: why not "just" extend https://www.geeksforgeeks.org/how-to-validate-mac-address-using-regular-expression/ :
"just" with negative lookbehind and negative "look forward"? Why the three formats..?