Can't replace with regex containing lookaround that matches across lines

WORMSS commented 3 years ago

Issue Type: Bug

Find:

import type \{ (\w+) \}

Replace

import { $1 }

Does what you expect, and turns import type { Test } into import { Test } but if you use the regex which includes back references

import type \{ (\w+) \}(?=[\s\S]*(new \1\(|\b\1\.|instanceof \1\b|\b\1\())

And replace with the same

import { $1 }

import type { Test } into import { $1 }

I cannot seem to find a way to cheese the system

Test example

import { A } from './A';
import type { B } from './B';
import { C } from './C';
import type { D } from './D';
import { E } from './E';
import type { F } from './F';
import { G } from './G';
import type { H } from './H';

B();
D.hello();
instanceof F()

In-file find and replace works perfectly fine. 👍

But the 'find in files' one breaks 👎

VS Code version: Code 1.57.1 (507ce72a4466fbb27b715c3722558bb15afa9f48, 2021-06-17T13:28:07.755Z) OS version: Windows_NT x64 10.0.19043 Restricted Mode: No

System Info

|Item|Value| |---|---| |CPUs|Intel(R) Xeon(R) CPU E5-1620 v3 @ 3.50GHz (8 x 3492)| |GPU Status|2d_canvas: unavailable_software
gpu_compositing: disabled_software
multiple_raster_threads: enabled_on
oop_rasterization: disabled_off
opengl: disabled_off
rasterization: disabled_software
skia_renderer: enabled_on
video_decode: disabled_software
vulkan: disabled_off
webgl: unavailable_software
webgl2: unavailable_software| |Load (avg)|undefined| |Memory (System)|15.92GB (1.88GB free)| |Process Argv|--crash-reporter-id 21d722df-78c7-4ac1-bbfa-5b7156602bd2| |Screen Reader|no| |VM|0%|

Extensions (32)

Extension|Author (truncated)|Version ---|---|--- alloy-vscode|cmc|0.0.1 ascii-tree-generator|apr|1.2.4 markdown-preview-github-styles|bie|0.2.0 npm-intellisense|chr|1.3.1 bracket-pair-colorizer|Coe|1.0.61 vscode-svgviewer|css|2.0.0 vue-peek|dar|1.0.2 vscode-eslint|dba|2.1.23 java-decompiler|dgi|0.0.2 vscode-ts-auto-return-type|ebr|1.1.0 RunOnSave|eme|0.2.0 prettier-vscode|esb|8.0.1 todo-tree|Gru|0.0.213 jbockle-format-files|jbo|3.3.0 vscode-js-annotations|lan|0.11.0 node-module-intellisense|lei|1.5.0 camelcasenavigation|map|1.1.3 rainbow-csv|mec|1.9.1 vscode-typescript-tslint-plugin|ms-|1.3.3 node-modules-resolve|nau|1.0.2 uuid-generator|net|0.0.4 printcode|nob|3.0.0 vetur|oct|0.34.1 vscode-versionlens|pfl|1.0.9 java|red|0.80.0 vscode-xml|red|0.17.0 vscode-sort-json|ric|1.20.0 gitconfig|sid|2.0.1 vscode-stylelint|sty|0.86.0 open-in-browser|tec|2.0.0 sort-lines|Tyr|1.9.0 vscode-icons|vsc|11.5.0

A/B Experiments

``` vsliv368:30146709 vsreu685:30147344 python383cf:30185419 pythonvspyt602:30300191 vspor879:30202332 vspor708:30202333 vspor363:30204092 pythonvspyt639:30300192 pythontb:30283811 pythonvspyt551:30311712 vspre833:30321513 pythonptprofiler:30281270 vshan820:30294714 pythondataviewer:30285071 vscus158:30321503 pythonvsuse255:30323308 vscorehov:30309549 vscod805:30301674 pythonvspyt200:30329803 binariesv615:30325510 vsccppwtct:30329789 ```

vscodebot[bot] commented 3 years ago

(Experimental duplicate detection) Thanks for submitting this issue. Please also check if it is already covered by an existing one, like:

roblourens commented 3 years ago

I don't even follow how that regex is supposed to match. A simpler example works fine for me. Please check your regex

WORMSS commented 3 years ago

I have already shown the regex works above. It works fine in "in-file" search and replace But global it doesn't. Hence the bug.

WORMSS commented 3 years ago

I've removed the new \1\( part as it was extraneous and fulfilled by \b\1\( but the problem still remains.

import type            // Match the characters “import type ” literally
\{                    // Match the character “{” literally
                       // Match the character “ ” literally
(                      // Match the regular expression below and capture its match into backreference number 1
   \w                 // Match a single character that is a “word character” (letters, digits, and underscores)
      +                // Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
                       // Match the character “ ” literally
\}                    // Match the character “}” literally
(?=                    // Assert that the regex below can be matched, starting at this position (positive lookahead)
   [\s\S]            // Match a single character present in the list below
                         // A whitespace character (spaces, tabs, and line breaks)
                         // Any character that is NOT a whitespace character
      *                // Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
   (                   // Match the regular expression below and capture its match into backreference number 2
                         // Match either the regular expression below (attempting the next alternative only if this one fails)
         \b           // Assert position at a word boundary
         \1           // Match the same text as most recently matched by capturing group number 1
         \.           // Match the character “.” literally
      |                // Or match regular expression number 2 below (attempting the next alternative only if this one fails)
         instanceof    // Match the characters “instanceof ” literally
         \1           // Match the same text as most recently matched by capturing group number 1
         \b           // Assert position at a word boundary
      |                // Or match regular expression number 3 below (the entire group fails if this one fails to match)
         \b           // Assert position at a word boundary
         \1           // Match the same text as most recently matched by capturing group number 1
         \(           // Match the character “(” literally
   )
)

WORMSS commented 3 years ago

A simpler example works fine for me. Please check your regex

Your regex doesn't even contain a positive lookahead. Please try to make your "simpler example" more complex

roblourens commented 3 years ago

The problem is that rg only gives us the lines that actually matched, not lines that match with a lookahead, so we can't compute the replace with the group reference just on the preview.

I can't find an issue, but I think we've discussed it before. We would have to either open the file to do this with the full file contents, or strip the lookahead from the regex (probably not safe)

jrittenh commented 2 years ago

I believe I've run into the same issue.

Regex: (?<=\n)((!+-?){2,}\n)(?!\n)

Should match:

!!!!!!!-!!!!-!!!!!!!!!!-!-!!!!!!!!
Thisisa samp lelinethat w illmatch

But it should not match:

!!!!!!!-!!!!-!!!!!!!!!!-!-!!!!!!!!!!!

Thisisa samp lelinethat w illnotmatch

Nor should it match:

! This is text
!
! this is more text

My replacement ($1\n) is doing a literal replace instead of replacing with the capture group.

maxsu commented 2 years ago

We would have to either open the file to do this with the full file contents, or strip the lookahead from the regex (probably not safe)

@roblourens stripping the lookahead and re-matching against the match string would work in most cases, but here's a small example where it would break:

The regex /(.*)(?=.$)o?/ matches the hello part of strings hello1 and hello. However, match group $1 is hello for the first string, but hell for the second string. Stripping the lookahead gives us the regex /(.*)o?/, where $1 = hell for both match strings.

We could analyze the regex for this form of context sensitivity, but that sounds complex and still fails in corner cases.

I would like to investigate your other suggestion - interpret the file contents for each match string.

Can you point me to the appropriate method? I'm looking for the method that receives the match string from ripgrep, and would populate the match groups if it could, but instead defaults the match groups to the literal strings'$1', '$2', '$3', etc.

roblourens commented 2 years ago

This would not be a small refactoring, and I don't think I want to take a PR for this, thanks though.

maxsu commented 2 years ago

@roblourens, I hear you and now agree with you - reparsing the file is unwieldy and would not be worth the debugging pain.

A proposal - if I find a way to use the ripgrep --json submatch info, could we move forward with this issue? This would be a much easier and smaller PR.

I'm familiar with the vscode ripgrep search engine, but not how it integrates into file replace. Could you please point me to some relevant code?

roblourens commented 2 years ago

I think that might have the info we need, but using that instead would not be a small refactoring either, especially because it will touch extension API which complicates the change.

maxsu commented 2 years ago

@roblourens can you sketch what you'd like the solution to be like?