Ignore Regex: Neither finding a suiting regex nor an "invert" option

georg-d commented 1 year ago

I want provide a documentation on the RegEx feature, so I want to create a patch file for Help.md. While one example is ready, the second example raised questions. I succeeded to cause compare to ignore some parts, but I did not find an inverse regex, so to ignore exactly the other part. I would have quickly reached the goal if ComparePlus had an option to "negate the provided regex" (like -v for grep) or an option "use following pattern for compare" which accepts the common substitution patterns like $1 and the result will be used by compare – which would also allow quite sophisticated stuff, e.g. to re-order characters like for regional differences in date format. But maybe such options are not required and someone comes up with a regex pattern to ignore everything except leading 3 digits?

This example is inspired by https://github.com/pnedev/comparePlus/issues/313 but with modified input to show all possible cases. File 1 is

123 same
234 differs in the texts!
345 differs in number
456 differs in text + number, 1st file
56 differs in number
foo

and file 2 is

123 same
234 changes in characters
648 differs in number
789 differs in text + number, 2nd file
9 differs in number
bar

For the help file, I'd like to show a) how to compare only the string behind the leading numbers and b) how to only compare the leading numbers (so what was requested in above mentioned issue). My tries and the results:

RegEx example first 3 descimals

Comparing both files with default settings is clear to me and not satisfying any of the two goals – it's illustrating the motivation, so why one may want to use regex. Details (you may skip them if result is clear to you): It will highlight lines 4 and 5 as different because the default value 30% Min line resemblance to mark as changed is reached (first 6 of total 11 characters are identical) and last line as new (added/deleted) because 0% of characters are identical.
Opening Ignore Regex... and typing ^\d{3} and clicking Enable is clear to me, causes compare to ignore first 3 digits and is thus reaching goal a. Details on how it works (you may skip this if regex is clear to you): This regex causes compare to ignore first 3 chars if they are decimals. So if a line does not start with 3 digits, whole line is compared, and if line starts with 3 digits, compare ignores first 3 characters and only looks at the remainder of the line, so the text behind first 3 characters, i.e. column 4 and beyond is the relevant part. In line 2, this relevant part has 21 characters of which only 3 (space and "in") don't differ, and because 3/21=14% is below the default of 30% for Min line resemblance to mark as changed, the line is not considered to have changed but to be new (added/deleted). In line 3, the relevant part is completely identical, so the line is considered to have no relevant difference and is not marked at all. Line 4 has only a slight difference of 3/34=9% of characters in the relevant part, hence, it not considered new (added/deleted) but changed. As lines 5 and 6 do not start with 3 digits, nothing is ignored and they are compared like in 1st case.
Opening Ignore Regex... and typing (?:^\d{3})(.*) and clicking Enable is clear to me and only nearly a solution for goal b. Details on how it works (you may skip this if regex is clear to you): 1st parentheses form a non-capturing group thus not "consuming" any characters but just defining the rest of the regex shall only be considered if first 3 characters are decimals (this causes lines 5 and 6 to be highlighted), and the 2nd parentheses fetch any amount of any characters, so whole line is matched and thus shall be ignored by compare – causing lines 2,3 and 4 to be considered unchanged. I did not yet find a way to restrict 2nd parentheses to the part behind the three digits, or to leave the 3 digits out completely:
- I tried by forbidding decimals as start, but (?:^\d{3})([^\d].*) does not create a different result – which I do not understand, as the capturing group must not include the starting digit, so at least lines 3 and 4 shall be considered not equal. You can try out the regex in regex101 or regexr and substitute by $1 to see it does return exactly the requested content (in row 1 to 4 the part behind column 3 and in rows 5 and 6 complete row because they do not start with 3 digits).
- I thought it may not be greedy enough and tried (?:^\d{3})([^\d].*+) but that is rejected because of last + (unsupported syntax).
- Also a syntax error with positive lookbehind like (?<=\d{3})([^\d].+) which would need some fine tuning to work for line 5 and 6.
- Also a syntax error with atomic groups, conditional statements, control verbs like (*COMMIT) and some other advanced regex syntax – hence, I did not try many more of them.

I welcome a regex reaching goal b as much as announcement of adding one or both options mentioned in the beginning 🙂

pnedev commented 1 year ago

Hello @georg-d ,

Good suggestions, thank you. I do not know when I will have time working on that and I'm also not very good with regex-es. The shortcomings you describe are directly related to the regex engine used by the ComparePlus plugin which is the C++ standard library regex implementation. The Notepad++ forums regex guru @guy038 can help with sophisticated regex-es but as he mentioned in this forum thread if very advanced regex-es are required it will be better for the plugin to use Boost library Regex engine (as Notepad++ itself does). I have not decided yet to do so, I'll consider it.

I will appreciate a PR with regex examples and help on Ignore Regex feature in Help.md :+1: Thanks.

BR

guy038 commented 1 year ago

Hello, @georg-d, @pnedev,

First of all, let's imagine a theorical comparison with only the first line in the two files, then only the second line in the two files and so on..., without any regex restriction. Then, normally, depending on the Min line ressemblance to mark as changed (%) option, in the settings, which is set to 30, by default, all lines, but the last, should be considered as changed !

Indeed :

Line 1 : 8 identical characters out of 8 => 8/8 = 100 % => identical lines not marked
Line 2 : 8 identical characters out of 25 => 8/25 = 32 %, which is > 30 %
Line 3 : 18 identical characters out of 21 => 18/21 = 86 %, which is > 30 %
Line 4 : 32 identical characters out of 38 => 28/38 = 84 %, which is > 30 %
Line 5 : 18 identical characters out of 20 => 18/20 = 90 %, which is > 30 %
Line 6 : 0 identical characters out of 3 => 0/3 = 0 %, which is < 30 %

@georg-d, you can verify my asertion, if you add a blank line, between each line in the two files, every line is displayed with its right highlighting !

In addition, let's imagine that you change the i character, in the word in of line 2, by a digit, for instance. Then, it remains 7 identical chars out of 25 => 7/28 = 28 %. As it lower than 30, the lines 2 are, as expected, conidered as new and removed lines !

However, when using your test files, without any blank line in between, it happens that lines 2 and 3 are considered as new/removed lines ?

So it seems, @pnedev, that, in this specific case, it breaks down the min line ressemblance rule. I won't consider it as a bug but I prefer that you are aware of ;-))

Now, @georg-d, you regex explanations are not exact in some cases. For instance you said, in point 3, concerning the regex *`(?:^\d{3})(.)`** :

1st parentheses form a non-capturing group thus not "consuming" any characters but just defining the rest of the regex shall only be considered if first 3 characters are decimals

No ! Of course, the part (?:^\d{3}) defines a non-capturing group but it does consume characters, as well as the remaining of the regex ! The constructions that do not consume characters are called look-arounds. There are, mainly, the (?<=...), (?<!...), (?=...) and (?!...) forms

So, the end of your sentence is correct if we consider, either, the (?:^\d{3})\K(.*) or the ^\d{3}\K(.*) regexes, which, indeed, graps the rest of the line ONLY IF the first three characters of the line were digits. Unfortunately, this syntax is not allowed by the ECMAScript regex implementation of the ComparePlus plugin :-((

See the solution, below !

Now, let's consider the @georg-d's test files with blank lines in between ( my version )

File 1 is :

123 same

234 differs in the texts!

345 differs in number

456 differs in text + number, 1st file

56 differs in number

foo

And file 2 is :

123 same

234 changes in characters

648 differs in number

789 differs in text + number, 2nd file

9 differs in number

bar

Starting with these two texts :

In order to ignore all the leading digits, during comparison, enter the simple regex ^\d{1,3} in the Line portions ignore regex zone and click on the Enable button. Thus, we simply consider the text after the leading digits, for the comparison :
- The even non-blank lines are highlighted. As most of characters are completely different, in the second non-blank line, it is logically seen as new /removed lines
- All odd non-blank lines which have the same text, after the leading digits, are not taken in account !
In order to ignore all text after the leading digits, during comparison, enter the regex *`[^\d\r\n].$** in the **Line portions ignore regex** zone and click on the **Enable` button. Thus, this time, we simply consider the leading** digits, for the comparison :
- As the digits are almost/completely different between the two files, the 3 lines, above the foo/bar line, are logically considered as new / removed lines !
- In all the other lines, the number of digits is identical between the two files, even in the last line foo /bar ( 0 digit for each file ! ). Thus, these lines are logically not highlighted at all !

Now, @pnedev, I suppose, that the exclude regex region feature could easily be changed into an include regex region part :

The Ignore Regex... option, in Plugins > ComparePlus, would be renamed Ignore Regex Regions in Lines...
The title of the dialog would be also changed as Ignore Regex Regions in Lines...
The dialog would have two fields, with a radio button management :
- Regex Regions to Exclude or Regex Regions to be Excluded from the Comparison
- Regex Regions to Include or Regex Regions to be Included in the Comparison

And, of course, if the Regex Regions to Include with an appropiate regex, would be selected, you would invert the present logic of ComparePlus and consider that the regex should NOT be ignored !

The Enable button would, as before, valids one of the two new regex limitations for comparison
The Disable button would, as before, ignore any regex limitations and use the default all-line contents for comparison

@pnedev, I also suppose that this Ignore Regex Regions in Lines... step must be done BEFORE verifying the other restrictions regarding the Ignore Spaces, Ignore Empty Lines and Ignore Case options ?

Note that this would lead to strange situations in some cases : for instance, if you tick the Ignore Empty Lines and simultaneously choose the regex ^.+ in the Line portions ignore regex zone. This means that you, both, want to ignore empty lines and non-empty lines ! Thus, you get the file ... and ... match message, which is the logical result !

Best Regards,

guy038

Remainders :

No need to use any line-end char ( \r\n, \n , \r ), as well as the \R syntax which is not allowed, because the ComparePlus plugin uses a line by line basis
If anchors are needed, use the common ^ and $ boundaries
With the ECMAScript regex implementation of the ComparePlus plugin, note that the following features are not allowed :
- The look-behind assertions like (?<=....) and (?<!....)
- The \K feature
- The atomic groups, like *`(....)+ **, **(....)++** and **(....)?+`**
- The conditional stuctures, as (?1....) or (?2....:....)
- The backtrack control verbs as (*FAIL), (*SKIP), (*COMMIT), (*PRUNE), ...

However, in most cases, it's easy enough to find out an alternative to these limitations !

pnedev commented 1 year ago

Hello @guy038 ,

Thank you very much for your thorough analysis and regex help and for the suggestions too!

I'll fully try your examples later but let me add a few comments now:

So it seems, @pnedev, that, in this specific case, it breaks down the min line ressemblance rule. I won't consider it as a bug but I prefer that you are aware of ;-))

Yes, the min line resemblance rule is sometimes not that strict, I am aware of it and as you mentioned it is not a bug actually. The min resemblance rule in some cases is exact and in others is more an approximation or a guide (at least that is what it looks like to the user). Thank you for pointing that out.

Now, @pnedev, I suppose, that the exclude regex region feature could easily be changed into an include regex region part :

I would prefer to avoid such long entries in the plugins menu as the suggested Ignore Regex Regions in Lines... but that's why for user information I have named the ignore regex dialog entry Line portions ignore regex. Don't you think it is informative enough?

About adding the invert option (consider only the regex matching ranges in the line comparisons) - I would prefer to stick to the ignore option only for consistency on one hand and also because that will complicate the ignore logic on the other. I will consider adding it if I receive enough user feedback that it is needed just to justify the effort. Thank you for the suggestion anyway - if I am to implement it I would surely do something like that.

@pnedev, I also suppose that this Ignore Regex Regions in Lines... step must be done BEFORE verifying the other restrictions regarding the Ignore Spaces, Ignore Empty Lines and Ignore Case options ?

Yes, that's right. And in some occasions it might lead to strange/unexpected behavior (as in your example perhaps).

Thanks once again for the help :+1:

Happy holidays!

pnedev / comparePlus

Ignore Regex: Neither finding a suiting regex nor an "invert" option #341