Added support for accurate log message params extraction

logpai / Drain3

A robust streaming log template miner based on the Drain algorithm

Other

470 stars 134 forks source link

Added support for accurate log message params extraction #52

Closed Impelon closed 2 years ago

Impelon commented 2 years ago

This is a solution for #49 that will work in most cases. It works by replacing a mask in the template with the pattern corresponding to that mask. If there are multiple MaskingInstruction with the same mask_with-value, their patterns will be seen as alternatives, joined using |. In this case, the order of the patterns will be the same as the order of the MaskingInstruction-objects, preserving the same machting-priorities.

Because named groups may only be present once in a regex-pattern, the code is a bit more complicated, as named groups need to be renamed to be made unique.

The new method will work best if mask_with is unique across all MaskingInstruction-objects, because the patterns cannot be confused. Additionally not using * as the value of any mask_with, will ensure that only the exact pattern of that MaskingInstruction is being matched. For better performance, any masking pattern should avoid using named groups.

Because the new method may be more performance-heavy, I've introduced a boolean-option to skip some computation steps, which hopefully means it remains comparatively fast. Deactivating exact_macthing will be faster, but suffer from the problems laid out in #49. (Note: In my use-case it slowed down the overall process of parsing and fetching the parameters by a factor of around 2.) There may be some optimizations one could try to further reduce the performance-impact of the new method I will point out as PR-comments.

This PR includes the extended tests from #51 and adds some of its own. Feel free to make any changes!

Bonus: As it is useful in my use-case and the overhead for returning the masks is low, I've also added a get_parameter_and_mask_list method, that returns pairs of parameters and corresponding masks.

Impelon commented 2 years ago

Thanks for the feedback, I'm glad we found a workable solution! I'll do my best to incorporate your suggestions.

Impelon commented 2 years ago

@davidohana I tried to consider all suggestions, hopefully I haven't missed much.

The doc of get_parameter_details_list is still not great; I'm not sure how to describe the differences between get_parameter_details_list and get_parameter_listwithout just pointing out the different return values.

mask_name_to_spec and those respective spec-tuples themselves are not necessary anymore, since they were just storing the information needed from MaskingInstruction. Instead I've refactored the masking-module a bit and made LogMasker keep a map of mask-names to MaskingInstruction-objects. I hope you approve of the refactoring made to masking, even if it is not essential to the goal of this PR.

Impelon commented 2 years ago

As for performance; The changes made seem to have helped quite a bit with performance at least for my configuration.

When using caching on _template_to_regex, the parameter extraction was at least as fast as in v0.9.8, if not faster. And this is easily implemented with the @cached-decorator from cacheutils (which you are already using) or @lru_cache from functools.

@cached(LRUCache(1000))
def _template_to_regex(self, log_template: str, exact_matching: bool):

I did not include this small change in the PR (yet), because I expect that a config-option should be introduced to specify the maximum size of this if wished for, but I'm unsure how to call it and how to update the config without breaking backwards compatibility.

Impelon commented 2 years ago

@davidohana Thanks for the in-depth review! If possible, could you go through my answers to your feedback before I make another revision, as some things are still unclear to me. (Sorry for the messy review-history. Apparently I started a review yesterday and never closed it, so my comments were all over the place in the timeline. Things should be more clear now, since I closed it.)

davidohana commented 2 years ago

@Impelon

I worked on simplification of _template_to_regex(). I think that the version below is simpler and more elegant

Less escaping of strings.
Param names get sequential numbers
Param names in masks are now just removed It passes all tests. Please check on your use case, and if you agree, please merge it in your fork. Sorry if you get any merge conflicts, due to our simultaneous work.

    def _template_to_regex(self, log_template: str, exact_matching: bool):
        param_group_name_to_mask_name = dict()
        param_name_counter = [0]

        # Create a named group with the respective patterns for the given mask-name.
        def create_capture_regex(_mask_name):
            allowed_patterns = []
            if exact_matching and _mask_name in self.masker.mask_names:
                for mi in self.masker.instructions_by_mask_name(_mask_name):
                    # MaskingInstruction may already contain named groups.
                    # we convert named groups into unnamed groups to avoid unnecessary captures
                    # and duplicate groups names
                    mi_groups = mi.regex.groupindex.keys()
                    pattern = mi.pattern
                    for group_name in mi_groups:
                        _search_str = r"(?P=<{}>".format(group_name)
                        pattern = pattern.replace(_search_str, "(")
                        _search_str = r"(?P<{}>".format(group_name)
                        pattern = pattern.replace(_search_str, "(")
                    allowed_patterns.append(pattern)
            if _mask_name == "*" or not exact_matching:
                allowed_patterns.append(r"\S+")
            # give each capture group a unique name to avoid conflicts
            param_group_name = "p_" + str(param_name_counter[0])
            param_name_counter[0] += 1
            param_group_name_to_mask_name[param_group_name] = _mask_name
            joined_patterns = "|".join(allowed_patterns)
            capture_regex = "(?P<{}>{})".format(param_group_name, joined_patterns)
            return capture_regex

        # For every mask in the template, replace it with a named group of all
        # possible masking-patterns it could represent (in order).
        mask_names = set(self.masker.mask_names)
        mask_names.add("*")
        escaped_prefix = re.escape(self.masker.mask_prefix)
        escaped_suffix = re.escape(self.masker.mask_suffix)
        template_regex = re.escape(log_template)
        for mask_name in mask_names:
            search_str = escaped_prefix + re.escape(mask_name) + escaped_suffix
            while True:
                rep_str = create_capture_regex(mask_name)
                # replace one-by-one to get a new param group name for each replacement
                template_regex_new = template_regex.replace(search_str, rep_str, 1)
                # break when all replaces for this mask are done
                if template_regex_new == template_regex:
                    break
                template_regex = template_regex_new

        template_regex = re.sub(r"\\ ", r"\\s+", template_regex)
        template_regex = "^" + template_regex + "$"
        return template_regex, param_group_name_to_mask_name

davidohana commented 2 years ago

Please tell me when you are ready for another review iteration.

Impelon commented 2 years ago

@davidohana Regarding your simplification of _template_to_regex:

I agree, using str.replace in stead of re.sub is a better approach in this case and requires less escaping.
Named groups cannot be removed without breaking some patterns. More specifically, patterns using (?P=some-name) will break, as that represents a backreference to a previously matched named group!
Because named groups cannot be removed, parameter-groups cannot be numbered sequentially, as a masking-pattern might include a named group with the same name.

As an example (?P<quote>[\"'`]).*?(?P=quote) will match correctly quoted strings. As I was aware of this, I should have added a test for this, and I will do that now. Technically, one could replace a named back-reference (?P=some-name) with an unnamed one \3 where 3 is the index of the group with name some-name, but this will only work if the pattern has less than 100 capturing groups overall, as \100 does not represent a backreference to the 100th capturing group. re.sub supports higher numbered groups using \g<some-number>, but this only works in the replacement value anyways, so it does not matter.

I hope noone actually uses patterns that have over 99 capturing groups. And while this would work inside one singular pattern, this will not work in our case because we combine multiple patterns inside a single pattern.

Thinking about this, I immediately notice another problem, namely that a masking-pattern like ([\"'`]).*?\1 will not work in the overall template_regex, and that actually all groups need to be named, so that backreferences like \1 can be replaced with a named backreference.

Will add some more tests to check this specifically.

davidohana commented 2 years ago

Fine, in this case we re-add the replacement of named captures in masks to the simpler _template_to_regex() impl.

davidohana commented 2 years ago

here is a fix:

    def _template_to_regex(self, log_template: str, exact_matching: bool):
        param_group_name_to_mask_name = dict()
        param_name_counter = [0]

        def get_next_param_name():
            param_group_name = "p_" + str(param_name_counter[0])
            param_name_counter[0] += 1
            return param_group_name

        # Create a named group with the respective patterns for the given mask-name.
        def create_capture_regex(_mask_name):
            allowed_patterns = []
            if exact_matching and _mask_name in self.masker.mask_names:
                for mi in self.masker.instructions_by_mask_name(_mask_name):
                    # MaskingInstruction may already contain named groups.
                    # we replace group names in those named groups, to avoid duplicate groups names
                    mi_groups = mi.regex.groupindex.keys()
                    pattern = mi.pattern

                    for group_name in mi_groups:
                        param_group_name = get_next_param_name()

                        def replace_captured_param_name(param_pattern):
                            _search_str = param_pattern.format(group_name)
                            _replace_str = param_pattern.format(param_group_name)
                            return pattern.replace(_search_str, _replace_str)

                        pattern = replace_captured_param_name("(?P={}")
                        pattern = replace_captured_param_name("(?P<{}>")
                    allowed_patterns.append(pattern)
            if _mask_name == "*" or not exact_matching:
                allowed_patterns.append(r"\S+")
            # give each capture group a unique name to avoid conflicts
            param_group_name = get_next_param_name()
            param_group_name_to_mask_name[param_group_name] = _mask_name
            joined_patterns = "|".join(allowed_patterns)
            capture_regex = "(?P<{}>{})".format(param_group_name, joined_patterns)
            return capture_regex

        # For every mask in the template, replace it with a named group of all
        # possible masking-patterns it could represent (in order).
        mask_names = set(self.masker.mask_names)
        mask_names.add("*")
        escaped_prefix = re.escape(self.masker.mask_prefix)
        escaped_suffix = re.escape(self.masker.mask_suffix)
        template_regex = re.escape(log_template)
        for mask_name in mask_names:
            search_str = escaped_prefix + re.escape(mask_name) + escaped_suffix
            while True:
                rep_str = create_capture_regex(mask_name)
                # replace one-by-one to get a new param group name for each replacement
                template_regex_new = template_regex.replace(search_str, rep_str, 1)
                # break when all replaces for this mask are done
                if template_regex_new == template_regex:
                    break
                template_regex = template_regex_new

        template_regex = re.sub(r"\\ ", r"\\s+", template_regex)
        template_regex = "^" + template_regex + "$"
        return template_regex, param_group_name_to_mask_name

Impelon commented 2 years ago

@davidohana This should address all the remaining feedback. If you do not like the name of the cache-config-option you can edit the PR directly! It's certainly not the most creative name.

Otherwise I really like the approach to rename every named-group; I did not think about that, but it really does simplify things.

For the problem of unnamed backreferences \1 - \99, I decided to not provide a full solution, but just an approximation using (?:.?+) again. Giving every capturing group a name just so unnamed backreferences work here seems overkill. (I have added a test using a masking-pattern that includes an unnamed backreference, and it works; but this will not produce correct results for more complex backreferences.) As the same can be archived using named backreferences, the function-doc simply suggests to use those instead.

davidohana commented 2 years ago

OK, I am satisfied with the code as it is.. at last :-) If you agree, I can merge.

Impelon commented 2 years ago

Looks good! Thank you for all the feedback, I think we had a very constructive review-process :)

davidohana commented 2 years ago

It was a pleasure to work on this with you. Many thanks for your great contribution!