microsoft / vscode

Visual Studio Code
https://code.visualstudio.com
MIT License
160.47k stars 28.1k forks source link

Global regex search with "Not matching character" doesn't match newline #75265

Open WORMSS opened 5 years ago

WORMSS commented 5 years ago

Issue Type: Bug

I was trying to search for

Promise\.all\(\[[^\]]*await[^\]]*\]\)

but I had no way to turn on multiline support so that ^\] would match newlines too. So searching for below would not be found.

return await Promise.all([
  await this.someFunction1(),
  await this.someFunction2(),
]);

I later put in the below and was happy to find that it suddely worked, but knew that it would only find if await was on the first line after the bracket, and also if there was no chars other than newline space or tab.

Promise\.all\(\[[^\]]*[\n\r \t]await[^\]]*\]\)

But, I then started to find others that actually SHOULDN'T have worked.. It was at this point, I realsed that ^\] was working as expected the first time, and was matching on newline chars.. (Perfect, I removed my tab/space hack).. To find, it stopped working again..

It seems I need atleast 1 \n in the regex to make it work.

My quick fix was to add this at the very end as such

Promise\.all\(\[[^\]]*await[^\]]*\]\)\n{0}

But this is obviously a dirty hack and not an obvious one, be far better to test for Not Matching and turn multiline back on, similar to what ever test you are doing for searching for \n and turning on multiline

VS Code version: Code 1.35.0 (553cfb2c2205db5f15f3ee8395bbd5cf066d357d, 2019-06-04T01:17:12.481Z) OS version: Windows_NT x64 10.0.10240

System Info |Item|Value| |---|---| |CPUs|Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz (8 x 3492)| |GPU Status|2d_canvas: enabled
checker_imaging: disabled_off
flash_3d: enabled
flash_stage3d: enabled
flash_stage3d_baseline: enabled
gpu_compositing: enabled
multiple_raster_threads: enabled_on
native_gpu_memory_buffers: disabled_software
rasterization: enabled
surface_synchronization: enabled_on
video_decode: enabled
webgl: enabled
webgl2: enabled| |Load (avg)|undefined| |Memory (System)|15.92GB (5.33GB free)| |Process Argv|| |Screen Reader|no| |VM|0%|
Extensions (19) Extension|Author (truncated)|Version ---|---|--- markdown-preview-github-styles|bie|0.1.6 npm-intellisense|chr|1.3.0 ssh|chr|0.0.4 vue-peek|dar|1.0.2 vscode-eslint|dba|1.9.0 vscode-ts-auto-return-type|ebr|1.0.1 tslint|eg2|1.0.43 RunOnSave|eme|0.0.18 prettier-vscode|esb|1.9.0 todo-tree|Gru|0.0.134 node-module-intellisense|lei|1.5.0 camelcasenavigation|map|1.0.1 rainbow-csv|mec|1.1.1 node-modules-resolve|nau|1.0.2 uuid-generator|net|0.0.4 vetur|oct|0.21.0 gitconfig|sid|2.0.0 open-in-browser|tec|2.0.0 sort-lines|Tyr|1.8.0
roblourens commented 5 years ago

Yeah, currently multiline mode is only turned on when a literal newline is in the regex, for perf.

WORMSS commented 5 years ago

I think the inability to force multiline string or not accounting for for all the possibilities for needing it turned on should be reviewed.

Newline is indeed not a literal bracket, so should be matched. And multiline needs to be on for that. Even if it's just showing an extra button on the right, nest to the regex search" when someone does a not match [^ only. You could even just silently bolt on \n{0} at the end internally after the input value is read. You wouldn't need to do any additional search code alterations. All edits would be at the UI component level, and multiline can be turned on by the user only if needed to keep perf by default. Though, searching for only Not Matching doesn't help those who want . to match newline also.

roblourens commented 5 years ago

The main reason we haven't done this is that we want to keep the number of possible switches under control, but it's a fair request.

dwelle commented 4 years ago

Wanted to create a new issue that \s does not match newlines, but found this.

@roblourens Yeah, currently multiline mode is only turned on when a literal newline is in the regex, for perf.

How about better newline-regex detection? Something like this:

const { assert } = require("chai");

// Just came up with this, thus requires more tests. Doesn't match some edge cases.
// Requires a variable-length negative-lookbehind support, which is supported by latest ES
const IS_MULTILINE = /(?<!(?<!\\)\[[^\]]*)(?<!\\)\\[ns]|(?<!\\)\[\^(.(?<!(?<!\\)\\[ns]))*(?<!\\)\]|(?<!\\)\[(?!\^)[^\]]*(?<!\\)\\[ns]/;

assert.notOk(IS_MULTILINE.test(/[^\n]/));
assert.notOk(IS_MULTILINE.test(/[^\s]/));
assert.notOk(IS_MULTILINE.test(/\[^a]/));
assert.notOk(IS_MULTILINE.test(/[\\s]/));

assert.ok(IS_MULTILINE.test(/[^\\s]/));
assert.ok(IS_MULTILINE.test(/[^\\n]/));
assert.ok(IS_MULTILINE.test(/[^s]/));
assert.ok(IS_MULTILINE.test(/[\s]/));
assert.ok(IS_MULTILINE.test(/\[\s/));
assert.ok(IS_MULTILINE.test(/[^\s]\s\]/));
assert.ok(IS_MULTILINE.test(/\[\s\]/));
WORMSS commented 4 years ago

I am assuming in this test below, s is just a substitute of any writeable char

assert.ok(IS_MULTILINE.test(/[^s]/));
dwelle commented 4 years ago

@WORMSS in this case it's anything other than literal s, which for this purpose seems an irrelevant test case which I must have mistakenly kept in the test suite when I was fuzzing all manner of cases.

WORMSS commented 4 years ago

image

Regex Buddy confirms. Though, I still had to do my hack to get it to match newline as a replaceable char. (ignoring the fact its treating \n\r as a single char. image

Windows obviously.

dwelle commented 4 years ago

I'm probably missing the point. /[^s]/ should match newline, hence:

assert.ok(IS_MULTILINE.test(/[^s]/));

is passing.

`s\ns`.match(/[^s]/)
// ["↵", index: 1, input: "s↵s", groups: undefined]
WORMSS commented 4 years ago

Sorry, I read the sentence back and it makes little sense to myself.

Regex Buddy agrees would have been a better way for me to say it.

But I wanted to point out that RegexBuddy seems to count \n as a "not an s" even with multiline turned off. I don't know if that makes much of a difference.

And strangeness that \r\n is counted as a single char in vsc. (which I actually prefer to be honest)

dwelle commented 4 years ago

\n is "not an s". Am I still missing the point? What are you suggesting? Where do you see a problem in the IS_MULTILINE regex behavior?

WORMSS commented 4 years ago

I think there might be a bigger issue with "multiline" term in the first place if we are trying to agree with each other but seem to be missing the mark.

I've just re-re looked up my understanding multiline is for ^ and $ usage. It seems javascript agrees with me, that "not an s" should match new line by default and not need multilined turned on. image

s flag would be . matches newline.

I am guessing IS_MULTILINE is just a completely different type of allow multiline that is not directly related to a regex multiline?

dwelle commented 4 years ago

I see --- yes, we're using the term to mean different things. When I say multiline in this context, I'm talking about ripgrep --multiline switch, which is what vscode uses to enable a regexp to match across multiple lines. From ripgrep's manual:

-U, --multiline
    Enable matching across multiple lines.

    When multiline mode is enabled, ripgrep will lift the restriction that a match
    cannot include a line terminator. For example, when multiline mode is not
    enabled (the default), then the regex '\p{any}' will match any Unicode
    codepoint other than '\n'. Similarly, the regex '\n' is explicitly forbidden,
    and if you try to use it, ripgrep will return an error. However, when multiline
    mode is enabled, '\p{any}' will match any Unicode codepoint, including '\n',
    and regexes like '\n' are permitted.

    An important caveat is that multiline mode does not change the match semantics
    of '.'. Namely, in most regex matchers, a '.' will by default match any
    character other than '\n', and this is true in ripgrep as well. In order to
    make '.' match '\n', you must enable the "dot all" flag inside the regex.
    For example, both '(?s).' and '(?s:.)' have the same semantics, where '.' will
    match any character, including '\n'. Alternatively, the '--multiline-dotall'
    flag may be passed to make the "dot all" behavior the default. This flag only
    applies when multiline search is enabled.
WORMSS commented 4 years ago

Then 👍 Big thumbs up that [^s] should indeed turn that mode on.. Ok.. I am up to date and agreeing. Good job everyone.. take a 5 minute break and enjoy your evening.

roblourens commented 4 years ago

It seems like you are just saying that "if a regex can match \n then the multiline flag should be turned on". Isn't this the same as having the multiline flag turned on all the time?

Anyway to me the whole point is that the majority of regexes are not intended to match multiple lines, it's quite common to include \s when you really are looking for spaces and tabs within a line. And including \s should not blow up your search results by matching a bunch of \r and \n when you did not explicitly ask for it.

This heuristic of using \n is my alternative to actually having a button or checkbox to enable multiline mode.

WORMSS commented 4 years ago

I explicitly asked it to match any char that was not a ] To the best of my knowledge \n and \r are not ] so therefore should have been matched. There is no visible indication anywhere that multiline is turned on or off. It is just some magic that goes on behind the scenes. Only way to get consistent results would be to add \n{0} to every regex I do.

You may feel that \s should match only spaces and tabs but can you guarantee that every single user of VSCode would (1) know that that is indeed what is going on, and (2) know how to bypass the limitation if they needed to?

roblourens commented 4 years ago

All I can do is what I think is right for the majority of users and hope that the rest can google their way to a solution. Because of the perf impact and possibility for unexpected behavior I think multiline search should always be an explicit opt in. My guess is that we would add a button for this some day but it will take careful UX thinking because we have too many options crammed into a small space already.

dwelle commented 4 years ago

This heuristic of using \n is my alternative to actually having a button or checkbox to enable multiline mode.

IMO it's much better to have slower search than to potentially cause bugs (among other things), e.g. when someone refactors code and the search doesn't match all it should, in which case you better hope your CI catches it.

Nothing worse than a regex engine that doesn't comply to spec in such a glaring way as redefining \s to exclude \n, which is pretty unintuitive. I'd at least show a warning message when the regex contains \s.

Also, having an option to opt-in for \s to include \n seems reasonable to me.

WORMSS commented 4 years ago

That doesn't help in my case when I am not searching for \s but searching for not ]

dwelle commented 4 years ago

That doesn't help in my case when I am not searching for \s but searching for not ]

True. I'd say your case is even more insidious, because it contains no whitespace characters in any way.

More generally, how should a user know that in vscode alone, you need to write (?:[^\]]|\n) instead of [^\]]?

Problem is, the above solution doesn't generalize because apparently you can't use (?:a|\n) in negative-lookbehind because the vscode regex parser thinks it's variable length (which it isn't -- works in rg).

We should definitely tell user in some way that vscode search redefines how newline matching works.

Mitko-Kerezov commented 1 year ago

Just ran into the same issue with the following use case: I have a large JSON file in the following format:

[
    {
        "key": []
    },
    {
        "key": []
    },
    ...
    {
        "key": [
            1
        ]
    },
    ...
    {
        "key": []
    },
    {
        "key": []
    },
]

I wanted to find the key which contained a non-empty array. Naturally I tried to search for "key": \[[^\]] in the file but was rewarded with 0 matches. Luckily I thought of searching for "key": (X matches) and then for "key": \[\] (X-1 matches) and figured out there's newline shenanigans. Took me the good part of an hour though. Would definitely like to see an option to enable multiline.

jayu commented 1 year ago

Hey mates @WORMSS Why not using structural search for your case?

I've created an extension that make it super easy to search such cases without having to fight with regexes!

Check CodeQue

The query would as simple as that:

Promise.all([
  await $$$(),  // $$$ is any statement wildcard
]);

I'm using next.js repo as test case here

https://user-images.githubusercontent.com/11561585/204633410-d022d47e-4219-4d3a-8d3a-654ad0f7b31d.mp4


@Mitko-Kerezov in your case, the query could bee like this

({
   "key": [$$$]
 })

You can give it a try in CodeQue Playground

WORMSS commented 1 year ago

I just use \n{0} a the end of all my searches.. been working fine so far..

krisutofu commented 5 months ago

How VSCode developers dared to violate the common standards among regex flavours? Any regex engine (except for the old shell command sed) finds newline when I search for whitespace \s or negative character classes which don't include a newline. This is the definition.

negated character class reference not matching newline in negated character classes is not a comparison criterion among important flavours

My problem, I want to search for multiline comments. Any javascript engine would match multiple lines with this regex: /\/\*[^*]*\*\//g

Only in VSCode, I need this: /\*(?:[^*]|\n)*\*/

(The \n{0} trick does not work for me.) It is less convenient to type.

In this case, /\*[.\n]*?\*/ would be even better (and works for VSCode and JavaScript) but it has different semantics in the general case.

It's a common advice to use [\s\S] for matching all characters but it doesn't work with VSCode.

VSCode's regex interpretation is unintuitive regarding newlines because no other relevant regex engine behaves this way. Most languages consist of multi-line structures where users will need to search for multi-line structures at some time (when matching code pieces).

Usability is always more important than program speed. Because usability is the bottle neck. In the opposite case, the work that the code would do is imposed on the user who is slower than the code. Sometimes, VSCode cannot even find multiline matches in the way I described above. (I found a case that led to catastrophic backtracking in VSCode.) Then I copy the text to another editor or to the browser to find/replace with their regex engine.

If VSCode cannot do it "because of the perf impact" then it is not the newline to blame but the implementation of VSCode. I never had any speed problems with other regex engines which allow me to match many lines at once. The actual performance impact is catastrophic backtracking which is easy to bypass with a good regex style.