Closed Impelon closed 2 years ago
Thanks for the feedback, I'm glad we found a workable solution! I'll do my best to incorporate your suggestions.
@davidohana I tried to consider all suggestions, hopefully I haven't missed much.
The doc of get_parameter_details_list
is still not great;
I'm not sure how to describe the differences between get_parameter_details_list
and get_parameter_list
without just pointing out the different return values.
mask_name_to_spec
and those respective spec-tuples themselves are not necessary anymore, since they were just storing the information needed from MaskingInstruction
.
Instead I've refactored the masking
-module a bit and made LogMasker
keep a map of mask-names to MaskingInstruction
-objects.
I hope you approve of the refactoring made to masking
, even if it is not essential to the goal of this PR.
As for performance; The changes made seem to have helped quite a bit with performance at least for my configuration.
When using caching on _template_to_regex
, the parameter extraction was at least as fast as in v0.9.8, if not faster.
And this is easily implemented with the @cached
-decorator from cacheutils
(which you are already using) or @lru_cache
from functools
.
@cached(LRUCache(1000))
def _template_to_regex(self, log_template: str, exact_matching: bool):
I did not include this small change in the PR (yet), because I expect that a config-option should be introduced to specify the maximum size of this if wished for, but I'm unsure how to call it and how to update the config without breaking backwards compatibility.
@davidohana Thanks for the in-depth review! If possible, could you go through my answers to your feedback before I make another revision, as some things are still unclear to me. (Sorry for the messy review-history. Apparently I started a review yesterday and never closed it, so my comments were all over the place in the timeline. Things should be more clear now, since I closed it.)
@Impelon
I worked on simplification of _template_to_regex()
.
I think that the version below is simpler and more elegant
def _template_to_regex(self, log_template: str, exact_matching: bool):
param_group_name_to_mask_name = dict()
param_name_counter = [0]
# Create a named group with the respective patterns for the given mask-name.
def create_capture_regex(_mask_name):
allowed_patterns = []
if exact_matching and _mask_name in self.masker.mask_names:
for mi in self.masker.instructions_by_mask_name(_mask_name):
# MaskingInstruction may already contain named groups.
# we convert named groups into unnamed groups to avoid unnecessary captures
# and duplicate groups names
mi_groups = mi.regex.groupindex.keys()
pattern = mi.pattern
for group_name in mi_groups:
_search_str = r"(?P=<{}>".format(group_name)
pattern = pattern.replace(_search_str, "(")
_search_str = r"(?P<{}>".format(group_name)
pattern = pattern.replace(_search_str, "(")
allowed_patterns.append(pattern)
if _mask_name == "*" or not exact_matching:
allowed_patterns.append(r"\S+")
# give each capture group a unique name to avoid conflicts
param_group_name = "p_" + str(param_name_counter[0])
param_name_counter[0] += 1
param_group_name_to_mask_name[param_group_name] = _mask_name
joined_patterns = "|".join(allowed_patterns)
capture_regex = "(?P<{}>{})".format(param_group_name, joined_patterns)
return capture_regex
# For every mask in the template, replace it with a named group of all
# possible masking-patterns it could represent (in order).
mask_names = set(self.masker.mask_names)
mask_names.add("*")
escaped_prefix = re.escape(self.masker.mask_prefix)
escaped_suffix = re.escape(self.masker.mask_suffix)
template_regex = re.escape(log_template)
for mask_name in mask_names:
search_str = escaped_prefix + re.escape(mask_name) + escaped_suffix
while True:
rep_str = create_capture_regex(mask_name)
# replace one-by-one to get a new param group name for each replacement
template_regex_new = template_regex.replace(search_str, rep_str, 1)
# break when all replaces for this mask are done
if template_regex_new == template_regex:
break
template_regex = template_regex_new
template_regex = re.sub(r"\\ ", r"\\s+", template_regex)
template_regex = "^" + template_regex + "$"
return template_regex, param_group_name_to_mask_name
Please tell me when you are ready for another review iteration.
@davidohana Regarding your simplification of _template_to_regex
:
str.replace
in stead of re.sub
is a better approach in this case and requires less escaping.(?P=some-name)
will break, as that represents a backreference to a previously matched named group!As an example (?P<quote>[\"'`]).*?(?P=quote)
will match correctly quoted strings.
As I was aware of this, I should have added a test for this, and I will do that now.
Technically, one could replace a named back-reference (?P=some-name)
with an unnamed one \3
where 3
is the index of the group with name some-name
, but this will only work if the pattern has less than 100 capturing groups overall, as \100
does not represent a backreference to the 100th capturing group.
re.sub
supports higher numbered groups using \g<some-number>
, but this only works in the replacement value anyways, so it does not matter.
I hope noone actually uses patterns that have over 99 capturing groups. And while this would work inside one singular pattern, this will not work in our case because we combine multiple patterns inside a single pattern.
Thinking about this, I immediately notice another problem, namely that a masking-pattern like ([\"'`]).*?\1
will not work in the overall template_regex
, and that actually all groups need to be named, so that backreferences like \1
can be replaced with a named backreference.
Will add some more tests to check this specifically.
Fine, in this case we re-add the replacement of named captures in masks to the simpler _template_to_regex()
impl.
here is a fix:
def _template_to_regex(self, log_template: str, exact_matching: bool):
param_group_name_to_mask_name = dict()
param_name_counter = [0]
def get_next_param_name():
param_group_name = "p_" + str(param_name_counter[0])
param_name_counter[0] += 1
return param_group_name
# Create a named group with the respective patterns for the given mask-name.
def create_capture_regex(_mask_name):
allowed_patterns = []
if exact_matching and _mask_name in self.masker.mask_names:
for mi in self.masker.instructions_by_mask_name(_mask_name):
# MaskingInstruction may already contain named groups.
# we replace group names in those named groups, to avoid duplicate groups names
mi_groups = mi.regex.groupindex.keys()
pattern = mi.pattern
for group_name in mi_groups:
param_group_name = get_next_param_name()
def replace_captured_param_name(param_pattern):
_search_str = param_pattern.format(group_name)
_replace_str = param_pattern.format(param_group_name)
return pattern.replace(_search_str, _replace_str)
pattern = replace_captured_param_name("(?P={}")
pattern = replace_captured_param_name("(?P<{}>")
allowed_patterns.append(pattern)
if _mask_name == "*" or not exact_matching:
allowed_patterns.append(r"\S+")
# give each capture group a unique name to avoid conflicts
param_group_name = get_next_param_name()
param_group_name_to_mask_name[param_group_name] = _mask_name
joined_patterns = "|".join(allowed_patterns)
capture_regex = "(?P<{}>{})".format(param_group_name, joined_patterns)
return capture_regex
# For every mask in the template, replace it with a named group of all
# possible masking-patterns it could represent (in order).
mask_names = set(self.masker.mask_names)
mask_names.add("*")
escaped_prefix = re.escape(self.masker.mask_prefix)
escaped_suffix = re.escape(self.masker.mask_suffix)
template_regex = re.escape(log_template)
for mask_name in mask_names:
search_str = escaped_prefix + re.escape(mask_name) + escaped_suffix
while True:
rep_str = create_capture_regex(mask_name)
# replace one-by-one to get a new param group name for each replacement
template_regex_new = template_regex.replace(search_str, rep_str, 1)
# break when all replaces for this mask are done
if template_regex_new == template_regex:
break
template_regex = template_regex_new
template_regex = re.sub(r"\\ ", r"\\s+", template_regex)
template_regex = "^" + template_regex + "$"
return template_regex, param_group_name_to_mask_name
@davidohana This should address all the remaining feedback. If you do not like the name of the cache-config-option you can edit the PR directly! It's certainly not the most creative name.
Otherwise I really like the approach to rename every named-group; I did not think about that, but it really does simplify things.
For the problem of unnamed backreferences \1
- \99
, I decided to not provide a full solution, but just an approximation using (?:.?+)
again. Giving every capturing group a name just so unnamed backreferences work here seems overkill.
(I have added a test using a masking-pattern that includes an unnamed backreference, and it works; but this will not produce correct results for more complex backreferences.)
As the same can be archived using named backreferences, the function-doc simply suggests to use those instead.
OK, I am satisfied with the code as it is.. at last :-) If you agree, I can merge.
Looks good! Thank you for all the feedback, I think we had a very constructive review-process :)
It was a pleasure to work on this with you. Many thanks for your great contribution!
This is a solution for #49 that will work in most cases. It works by replacing a mask in the template with the pattern corresponding to that mask. If there are multiple
MaskingInstruction
with the samemask_with
-value, their patterns will be seen as alternatives, joined using|
. In this case, the order of the patterns will be the same as the order of theMaskingInstruction
-objects, preserving the same machting-priorities.Because named groups may only be present once in a regex-pattern, the code is a bit more complicated, as named groups need to be renamed to be made unique.
The new method will work best if
mask_with
is unique across allMaskingInstruction
-objects, because the patterns cannot be confused. Additionally not using*
as the value of anymask_with
, will ensure that only the exact pattern of thatMaskingInstruction
is being matched. For better performance, any masking pattern should avoid using named groups.Because the new method may be more performance-heavy, I've introduced a boolean-option to skip some computation steps, which hopefully means it remains comparatively fast. Deactivating
exact_macthing
will be faster, but suffer from the problems laid out in #49. (Note: In my use-case it slowed down the overall process of parsing and fetching the parameters by a factor of around 2.) There may be some optimizations one could try to further reduce the performance-impact of the new method I will point out as PR-comments.This PR includes the extended tests from #51 and adds some of its own. Feel free to make any changes!
Bonus: As it is useful in my use-case and the overhead for returning the masks is low, I've also added a
get_parameter_and_mask_list
method, that returns pairs of parameters and corresponding masks.