sosreport / sos

A unified tool for collecting system logs and other debug information
http://sos.rtfd.org
GNU General Public License v2.0
511 stars 543 forks source link

[MAC cleaner] too relaxed IPV6_REG_4HEX matching UUIDs #3736

Open pmoravec opened 3 months ago

pmoravec commented 3 months ago

User story: cleaner runs for three days. The cause is it cleans postgres logs like:

2024-07-29 13:00:23 CEST LOG:  duration: 1195.895 ms  execute <unnamed>: SELECT "katello_rpms"."id", "katello_rpms"."pulp_id" FROM "katello_rpms" WHERE (pulp_id in ('/pulp/api/v3/content/rpm/packages/0190fe17-d1a6-7066-aa33-e7422232031f/','/pulp/api/v3/content/rpm/packages/0190fe17-d1ae-7091-a034-22ade2121dc0/','/pulp/api/v3/content/rpm/packages/0190fe17-d1b2-7377-bb34-47a9850ecb44/',..

with a list of >10k UUIDs - and MAC cleaner treats all the UUIDs as IPV6_REG_4HEX address (cf https://github.com/sosreport/sos/blob/main/sos/cleaner/parsers/mac_parser.py#L23-L26):

>>> IPV6_REG_4HEX = (
...     r'((?<!([0-9a-fA-F\'\"]:)|::)(([^:\-]?[0-9a-fA-F]{4}(:|-)){3}'
...     r'[0-9a-fA-F]{4}(\'|\")?(\/|\,|\-|\.|\s|$)))'
... )
>>> import re
>>> re.search(IPV6_REG_4HEX, '/pulp/api/v3/content/rpm/packages/0190fe17-d1ae-7091-a034-22ade2121dc0/')
<re.Match object; span=(37, 58), match='0fe17-d1ae-7091-a034-'>
>>> 

There are three issues here:

I was also thinking "aren't we reinventing wheel?" but I haven't found an ultimate regular expression for MAC address :-o (do I use DuckDuckGo wrong?).

EDIT: why not "just" extend https://www.geeksforgeeks.org/how-to-validate-mac-address-using-regular-expression/ :

'^([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})|([0-9a-fA-F]{4}\\.[0-9a-fA-F]{4}\\.[0-9a-fA-F]{4})$'

"just" with negative lookbehind and negative "look forward"? Why the three formats..?

TurboTurtle commented 3 months ago

MAC addresses have instilled in me a new level of loathing for regex.

I was also thinking "aren't we reinventing wheel?"

A bit, yes. The issue is largely that the stock examples commonly found assume that the string to be matched against is a discreet unit having a straight forward comparison to make. What I mean by this practically can be seen by looking at your example source - the test strings are nice and simple cases, but what we found originally when implementing cleaner many years ago, is that these regexes largely fall on their face when given log content.

Logs will have the nice and simple use case of a MAC address or similar separated by spaces, but they also have:

etc...etc... you get the idea.

Why the three formats..?

IPv6 mac addresses have 2 accepted formats, aa:bb:cc:dd:ee:ff::ab:cd and aabb:ccdd:eeff:abcd. IPv4 mac addresses (aa:bb:cc:dd:ee:ff) can be found as substrings of IPv6 addresses of the first format, but we don't want to match on those.

But can't we 1) enforce just one type of separator is used, no mixed ones

We could, and then we'd likely have things slipping through the cracks and end users (rightfully) complaining that we're not obfuscating everything.

If the extended look-behinds don't interfere with our ability to obfuscate addresses that aren't simply bracketed by spaces, for example my log message has value=aa:bb:cc:dd:ee:ff:ab:cd, then by all means let's improve this.

pierg75 commented 3 months ago

When we talk about the IPv6 mac addresses, do we mean the MAC address included in the IPv6 address? If this is the case (which it shouldn't be applied anymore because of https://datatracker.ietf.org/doc/html/rfc7217), then shouldn't the IPv6 regex take care of that?

pmoravec commented 2 months ago

I tried to come up with some regexp that will cover all my points (dont match UUIDs, force same separator at all positions) but I lack specification of the IPv6 mac addresses - or specification what we should obfuscate and what not.

So I wrote a simple script with test cases with strings that should (or not) be obfuscated:

#!/usr/bin/python3
import re, sys

IPV6_REG_4HEX_OLD = (
    r'((?<!([0-9a-fA-F\'\"]:)|::)(([^:\-]?[0-9a-fA-F]{4}(:|-)){3}'
    r'[0-9a-fA-F]{4}(\'|\")?(\/|\,|\-|\.|\s|$)))'
)

IPV6_REG_4HEX_NEW = (
        r'((?<![[0-9a-fA-F:]:|\w])(([0-9a-fA-F]{4}:){3}[0-9a-fA-F]{4})(?![:|\w]))|'
        r'((?<![[0-9a-fA-F-]-|\w])(([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{4})(?![-|\w]))'
)

IPV6_REG_4HEX = IPV6_REG_4HEX_OLD if (len(sys.argv)>1 and sys.argv[1]=="old") else IPV6_REG_4HEX_NEW

test_strings = (
        (False, '/pulp/api/v3/content/rpm/packages/0190fe17-d1ae-7091-a034-22ade2121dc0/'),
        (False, '0190fe17-d1ae-7091-a034-22ade2121dc0'),
        (True, 'aabb:ccee:ddee:ffaa'),
        (True, 'line=aabb:ccee:ddee:ffaa'),
        (True, 'line=aabb:ccee:ddee:ffaa, something else'),
        (True, 'address:aabb:ccee:ddee:ffaa'),
        (True, 'addr=[aabb:ccee:ddee:ffaa]'),
        (True, 'addr="aabb:ccee:ddee:ffaa"'),
        (True, "addr='aabb:ccee:ddee:ffaa'"),
        (True, 'aabb:ccee:ddee:ffaa, but not aa:bb:cc:dd'),
        (True, 'line=aabb:ccee:ddee:ffaa-should still match'),
        (False, 'not aabb:ccee:ddee:ffaaaaaa'),
#        (, ''),
#        (, ''),
#        (, ''),
)

intro=f"Matching against {IPV6_REG_4HEX}"
print(intro)
print("-"*len(intro))
print()

for match, line in test_strings:
    search = re.search(IPV6_REG_4HEX, line)
    print(f"{line}\t:\t\t{search}")
    if (None!=search) != match:
        print(f"\t\tERROR {line} should {'' if match else 'not '}match, but re.search={search}")

When running with argument old, current RE is used. When running without an argument, the new RE:

IPV6_REG_4HEX_NEW = (
        r'((?<![[0-9a-fA-F:]:|\w])(([0-9a-fA-F]{4}:){3}[0-9a-fA-F]{4})(?![:|\w]))|'
        r'((?<![[0-9a-fA-F-]-|\w])(([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{4})(?![-|\w]))'
)

is used.

Please show me examples where the new RE does not work as you expect - in either way.

TurboTurtle commented 2 months ago
logged_mac:aabb:ccee:ddee:ffaa=some_other_data  :       None
        ERROR logged_mac:aabb:ccee:ddee:ffaa=some_other_data should match, but re.search=None

To be fair, it appears our current regex also fails here (which I thought we had previously fixed) - but this is the kind of logging that I'm referring to. Where there are no spaces breaking up the mac addr from the rest of the string.

Also,

logged_mac:aabb:ccee:ddee:ffaa some_other_data  :       None
        ERROR logged_mac:aabb:ccee:ddee:ffaa some_other_data should match, but re.search=None

This fails when there is a trailing space.

pmoravec commented 2 months ago

Current cleaner fails to identify either MAC address, now. SO at least there is no regression :) In both cases, the key problem is the prefix before the address.

OK, I will come up with something better. But why the current R.E. even has the "MAC address can't follow after :: or after [0-9a-fA-F\'\"]:" condition? I just tried to respect this, to prevent regression. What strings (starting by e.g. :: followed by "looks-like MAC address") do not contain a MAC address? Does e.g. ::aabb:ccee:ddee:ffaa contain a MAC address? Current regexp does not think so.

So what can not be before the match?

pmoravec commented 2 months ago

What about simple:

IPV6_REG_4HEX_NEW = (
        r'((?<!\w)(([0-9a-fA-F]{4}:){3}[0-9a-fA-F]{4})(?!\w))|'
        r'((?<!\w)(([0-9a-fA-F]{4}-){3}[0-9a-fA-F]{4})(?!\w))'
)

? I.e. disallow just word characters prior and after the MAC address.

All these tests passed (I added a few at the end):

test_strings = (
    (False, '/pulp/api/v3/content/rpm/packages/0190fe17-d1ae-7091-a034-22ade2121dc0/'),
    (False, '0190fe17-d1ae-7091-a034-22ade2121dc0'),
    (True, 'aabb:ccee:ddee:ffaa'),
    (True, 'line=aabb:ccee:ddee:ffaa'),
    (True, 'line=aabb:ccee:ddee:ffaa, something else'),
    (True, 'address:aabb:ccee:ddee:ffaa'),
    (True, 'addr=[aabb:ccee:ddee:ffaa]'),
    (True, 'addr="aabb:ccee:ddee:ffaa"'),
    (True, "addr='aabb:ccee:ddee:ffaa'"),
    (True, 'aabb:ccee:ddee:ffaa, but not aa:bb:cc:dd'),
    (True, 'line=aabb:ccee:ddee:ffaa-should still match'),
    (False, 'not aabb:ccee:ddee:ffaaaaaa'),
    (True, 'logged_mac:aabb:ccee:ddee:ffaa=some_other_data'),
    (True, 'logged_mac:aabb:ccee:ddee:ffaa some_other_data'),
    (True, '::aabb:ccee:ddee:ffaa'),
    (True, 'we should obfuscate addr: aabb:ccee:ddee:ffaa: YES'),
)
TurboTurtle commented 2 months ago

LGTM, let's get that PR open :)