regular match still error

YLXDXX commented 2 years ago

Environment

TeXstudio: git 4.3.0beta2-23-g0ad5bc7b4
Qt: 5.15.2
OS: Debian 11

Expected behavior

在 #2448 中修复了正则表达式中非贪婪匹配的问题，但引入了贪婪匹配的问题。

对于字符串123(666)009(hi)，在自定义的宏（Macros）中设置触发条件（Trigger），正则表达式为

\((.+?)\)\t

匹配到的字符串为(hi)，这个没有问题。但是将正则表达式改为

\((.+)\)\t

匹配到的字符串依然为(hi)，出现了问题，正确的匹配结果应为(666)009(hi)。

这里，通配符+，没有尽可能的匹配最多，反而是匹配最少。

Goole Translate

Fixed non-greedy matching in regular expressions in #2448, but introduced greedy matching.

For the string 123(666)009(hi), set the trigger condition (Trigger) in the custom macro (Macros), the regular expression is

\((.+?)\)\t

The matched string is (hi), which is no problem. But change the regex to

\((.+)\)\t

The matched string is still (hi), there is a problem, the correct matching result should be (666)009(hi).

Here, the wildcard + does not match as much as possible, but matches the least.

sunderme commented 2 years ago

that will not work as greedy/non-greedy is defined from the left-hand side. As usually the trigger is more intended to be used from the right hand side, the code looks for the last possible match on the text line until the cursor. So, this will not be fixed (unless someone provides some code)

schtandard commented 2 years ago

I don't have a build environment set up so I didn't test it myself but I believe what OP is saying is that the + operator is now always non-greedy, which is indeed wrong. It should be greedy unless followed by ?.

However, I think that this whole fix might have been a misunderstanding. As the regex engine is not particular to TeXstudio but rather a part of Qt, the issue should be raised there. Then again, this shortcoming is well documented in their documentation, so I don't think they will consider it a bug.

Non-greedy matching cannot be applied to individual quantifiers

It goes on to say that an entire regex can be made non-greedy using setMinimal(), so maybe an additional switch to the search dialogue exposing that functionality may be worthwile. (It would at least work for the OP's regex.)

sunderme commented 2 years ago

this is not really an issue of greedy/non-greedy. greediness applies for for regex starting from the left hand side of a text line. the outer brackets are always matched first with any regex engine, greedy or non-greedy because the regex engine always starts from the left.

schtandard commented 2 years ago

Exactly. So the regex $(.+)$\t used on the text 123(666)009(hi) should match (666)009(hi), right? But OP reports that it only matches (hi) (after the #2448 fix, in the current release it behaves as expected).

(I assume that every text here has an omitted tab at the end, since otherwise the regex shouldn't match at all.)

sunderme commented 2 years ago

yes, because OP complained previously that he wanted to match (hi). Greediness of regex plays no role which of the two variants is matched, it is matched to (hi) now, as it makes probably more sense in the context of trigger detection (the current cursor being at the end of the line)

schtandard commented 2 years ago

I feel like I'm missing something here. I'll try to clarify with an examples. Consider the file

before
(xxx)yyy(zzz)
after

with the cursor at the start of the second line. Let's now search this using the regex $.+$.

With a greedy + this will match (xxx)yyy(zzz) and then continue searching in after.
With a lazy + it will match (xxx), then (zzz) and then continue searching in after.
If I understand correctly, the "fix" will lead to only (zzz) being matched before searching continues in after.

Note that, had the cursor been in the middle of the second line, (zzz) would be the first match in all three cases.

Now, sometimes the greedy behavior may not be what the user wants, but sometimes it is. More to the point though, it is the well documented behavior of regular expressions. Not only +, other quantifiers like * and ? also behave greedily by default. Breaking that seems like a very bad idea to me. (I am suprised that it is even being considered.)

I also think that OP wasn't complaining that + wasn't lazy, they were complaining that +? wasn't. That is the lazy variant of + but it is not supported by Qt.

Again, maybe I'm missing something super obvious here, maybe I'm misinterpreting the change. But if this really just truncates the left edge of every match as far as possible (which is not even the same as laziness, but that doesn't really matter now, I think), it will break a lot of regexes in unexpected ways. For example, if I want to remove all non-empty comments from my file by searching for #.+, it will only partially remove comments that contain a #.

antshar commented 2 years ago

I completely agree that it should definitely be fixed. @sunderme was the last explanation from @schtandard clear enough for you now?

sunderme commented 2 years ago

apparently it is not really understood why this can't be fixed. Maybe a complete example may help for this. You probably should try the example in an regexp engine to understand it, e.g. here https://regexr.com. The context is macro trigger. The string presented to the trigger is all text left of the cursor and it needs to match until string end,i.e. $ is automatically added to the regexp. case 1) greedy regexp: $(.+)$\t$ test string: (abc)def(ghi)\t match: (abc)def(ghi)\t

case 2) non-greedy regexp: $(.+?)$\t$ test string: (abc)def(ghi)\t match: (abc)def(ghi)\t

If you find a regexp which can deliver (ghi)\t or (abc)def(ghi)\t depending on the regexp string, make a proposal. Greediness does not change the match.

txs approach: apply regexp several times with start offset and take the last valid result as it is way more likely that you want to trigger on (ghi)\t or similar.

schtandard commented 2 years ago

Ok, this context was not clear to me and makes this a bit less problematic. However, I would suggest using the regex $[^()]+$\t instead. If I am not mistaken, it will only fail when there are nested or unbalanced parentheses, which is the same restriction as for the alternative approach.

sophie-sydney commented 1 year ago

I think this should be escalated to a bug.

The imposition of non-greedy regular expressions impacts the ability to meaningfully use the macro scripts function to capture text. This includes any regular expressions using +, * or {2,}, as well as any of the type <?=, which together seems to really limit the users' ability to create macros!

The error has only started since I updated to the latest version of texstudio, a few versions back it was working fine. I'm not sure what's changed here?

I've provided a simple worked example below, for a simple autocorrect function, showing regular expressions being automatically non-greedy instead of greedy. The below should print that it has found the preceding word after pressing '|', but instead returns the preceding character!

trigger:([A-Za-z]+)\|

%script
editor.write("Found '" + triggerMatches[1] +"'")

PS: One hint that may be useful if someone is investigating this issue. I've noticed is that there's been a shift in how regular expressions pick up empty captures. Whereas before triggerMatches would return an empty string "", it now picks up a null variable.

For instance for the trigger "h(i?)p", when "hp" is written, we will have that triggerMatches[1]==null will return true rather than triggerMatches[1]=="" returning true, which is what happened previously.

sunderme commented 1 year ago

I have explained the issue several times. This is not a bug but a logical issue. Please read carefully the explanation.

schtandard commented 1 year ago

@sunderme You are correct that greediness is not the issue here. However, the current behavior is clearly confusing some users. (Even OP's original request (#2448 and here) was based on a misunderstanding, though not due to the current behavior.)

If you find a regexp which can deliver (ghi)\t or (abc)def(ghi)\t depending on the regexp string, make a proposal.

Wouldn't my suggestion of $[^()]+$\t for (ghi)\t and $.+$\t for (abc)def(ghi)\t do this? (Given the more intuitive operation of just matching from the start of the line and not the current approach of taking the last possible starting position for a match.) I feel like this would have been the "correct" response to #2448 (i.e. "Greediness is not the issue, use a different regex.").

If you want to rather keep the current last-possible-match behavior, it would be helpful to document it in the help popup to the Trigger field. After all, one can work around this by using a different regex in most cases (e.g. \b(\w+)\| for @sophie-sydney) but one has to know what one is working around. (The online documentation could probably also use a clarification, as it is unclear if "the last written characters" includes as many or as few as possible. Most people will click on the question mark in the dialog before looking online, though.)

schtandard commented 1 year ago

I had a thought: What might be more intuitive (in the behavior, not in the syntax) and probably even computationally cheaper would be to match the line up to the cursor in reverse (i.e. if one types abcde the regex sees edcba), with a ^ automatically prepended to the regex. This would of course require the user to input the pattern in reverse, but it would do the "matching from the right" and the engine could stop looking for a match as soon as the first character (i.e. the last one before the cursor) does not match instead of trying every position in the line.

Don't really know if it is worth implementing, probably only if the computational benefits are substantial. Would have to be opt-in in order not to break existing triggers. Matches would have to be reversed before serving them to the macro code. A switch to use the non-greedy quantifiers would then (this time really) do what OP was expecting: \t$.+$ would match (abc)def(ghi)\t with greedy + and (ghi)\t with lazy +.

sunderme commented 1 year ago

that sounds like a bit too much mental gymnastics for most users, though it would solve the problem. Unless it can't be done automatically , it is not worth implementing.

texstudio-org / texstudio