golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.37k stars 17.58k forks source link

regexp: no way to replace submatches with a function #5690

Open gopherbot opened 11 years ago

gopherbot commented 11 years ago

by denys.seguret:

ReplaceAllStringFunc is useful when you need to process the match to compute the
replacement, but sometimes you need to match a bigger string than the one you want to
replace. A similar function able to replace submatch(es) seems necessary.

Let's say you have strings like

    input := `bla b:foo="hop" blabla b:bar="hu?"`

and you want to replace the part between quotes in b:foo="hop" and
b:bar="hu?" using a function.

It's easy to build a regular expression to get the match and submatch, for example

    r := regexp.MustCompile(`\bb:\w+="([^"]+)"`)

but when you use ReplaceAllStringFunc, the callback is only provided the whole match,
not the submatch, and must return the whole string. Practically this means you need to
execute the regexp (or another one) in the callback, for example like this :

        input := `bla bla b:foo="hop" blablabla b:bar="hu?"`
        r := regexp.MustCompile(`(\bb:\w+=")([^"]+)`)
        fmt.Println(r.ReplaceAllStringFunc(input, func(m string) string {
                parts := r.FindStringSubmatch(m)
                return parts[1] + complexFunc(parts[2])
        }))

I think a function ReplaceAllStringSubmatchFunc would be useful and would avoid the
second pass. The callback would receive the submatch and return the replacement of the
submatch. The last example would be rewritten as

        input := `bla bla b:foo="hop" blablabla b:bar="hu?"`
        r := regexp.MustCompile(`\bb:\w+="([^"]+)"`)
        fmt.Println(r.ReplaceAllStringSubmatchFunc(input, complexFunc))

A similar function (ReplaceAllStringSubmatchSliceFunc ?) could be designed to give the
callback an array of strings that the callback would change. In fact it could be decided
that only this last function is really necessary.

Links :

 - "How-to" question on Stack-Overflow : http://stackoverflow.com/q/17065465/263525
 - Playground link : http://play.golang.org/p/I6Pg8OUeTj
robpike commented 11 years ago

Comment 1:

Labels changed: added priority-later, packagechange, removed priority-triage.

Owner changed to @rsc.

Status changed to Accepted.

rsc commented 11 years ago

Comment 3:

Labels changed: added go1.3maybe.

robpike commented 11 years ago

Comment 4:

Labels changed: removed go1.3maybe.

rsc commented 10 years ago

Comment 5:

Labels changed: added go1.3maybe.

rsc commented 10 years ago

Comment 6:

Labels changed: added release-none, removed go1.3maybe.

rsc commented 10 years ago

Comment 7:

Labels changed: added repo-main.

gopherbot commented 10 years ago

Comment 8:

CL https://golang.org/cl/106360043 mentions this issue.
gopherbot commented 10 years ago

Comment 9 by denys.seguret:

Small comment : the whole thing could be cleaner that what I initially proposed by
accepting a callback with submatches passed as variadic instead of an explicit array.
victorhooi commented 9 years ago

I just hit this issue as well. Does "Unplanned" mean this is unlikely to get worked on?

I'm also including some information on my use-case, in case that helps.

I'm trying to transformed loglines containing key-value pairs, to redact any string values. So for example:

name: "Joe", last_name: "Bloggs", age: 5, nickname: "Jogs" }

might become:

name: "SOME_HASH", last_name: "SOME_HASH", age: 5, $comment: "do not redact me", nickname: "SOME_HASH" }

I only want to target quoted strings that are followed by either , (comma) or } (closing curly-braces), and I also want to ignore any $comment fields.

I know that Go's regexp doesn't have lookahead/lookbehinds, which means I can't check for the above. using those. That restricts me somewhat. However, I figured I'd just capture everything using a regex like this:

quoted_string_regex, _ := regex.Compile(`(\$comment: )?"([^"]*)"[,| }]`)

and then check the actual subgroups to see if $comment was there, and also grab out the comma or curly-brace, and put that back on at the end.

However, I'm using ReplaceAllStringFunc which only gives you the entire match - so it seem like I either need to do a second regex inside my callback function, or I need to do a bunch of contains/splits/ends-with etc.

(Obviously, if I've missed something obvious that is available in Go, please feel free to correct the above).

josharian commented 9 years ago

Does "Unplanned" mean this is unlikely to get worked on?

Unplanned just means that this won't potentially block a release. I know that @michaelmatloob has been looking at regexp stuff recently; perhaps he is interested.

crenz commented 7 years ago

Just wanted to add that I hit the very same issue today. I was trying to implement a simple tag replacement, e.g.

Name: {name}
First name: {firstname}

becomes

Name: Doe
First name: Jon

I'm coming from a Perl background; my first intuition was using a regexp like /{([^}]+)}/. Note the submatch in parentheses: In Perl, it would be possible to use replace (and call a function on the submatch) or use split (and get the submatches returned). In Go, split never returns the part that matches, and ReplaceAllStringFunc will return the complete string instead of just the submatch.

matloob commented 7 years ago

I'm not planning on working on this. If you're interested in contributing this, feel free to do so, but note that the freeze will start in a few days.

AlekSi commented 7 years ago

Is this issue solved by Regexp.Expand and Regexp.ExpandString?

ghost commented 7 years ago

@AlekSi I guess not, at least not in a straightforward way. The number of variables in the expand template is limited, whereas the number of matches in a string isn't.

srackham commented 6 years ago

I came across this post by Elliot Chance, it solved a JavaScript to Go porting problem I was having (for consistency it would be nice if it was incorporated as a new method in the Go regexp package):

http://elliot.land/post/go-replace-string-with-regular-expression-callback

Gist here: https://gist.github.com/elliotchance/d419395aa776d632d897

alisonatwork commented 5 years ago

Thanks for the link @srackham - I hit exactly the same problem with trying to port something from JavaScript to Go. It would definitely be nice to see this functionality inside the standard regexp package.

I also found another project which appears to implement similar functionality in perhaps a cleaner way because it replaces the default regexp: https://github.com/agext/regexp

This gives some idea of how the solution could look: https://github.com/agext/regexp/blob/master/agext.go#L105

slimsag commented 4 years ago

Here is a snippet for anyone else looking for a way to replace submatches with a function using bytes (not strings) and without having to deal with intermediate (non-captured) data: https://gist.github.com/slimsag/14c66b88633bd52b7fa710349e4c6749

inliquid commented 3 years ago

I have the same problem.

  1. There are text posts which may include specific links to files stored in a directory structure
  2. I need to parse these posts, find links to files, and then
  3. Move these files to different directory structure,
  4. Manipulate the original path, and
  5. Return new path as a replacement (and replace at the same time if possible).

I would use ReplaceAllStringFunc but I also need submatches which lead to making an additional call to same regexp within the repl function with FindAllStringSubmatch.

entonio commented 10 months ago

I've met this issue today. I'm sure I've met it before, but I've probably used some tedious, bug-prone, workaround.

volodymyrprokopyuk commented 8 months ago

Hi,

A solution I use to solve this problem does two regexp matches: one for Replace and another for Find which is inefficient:

func main() {
  str := "a: b, c: d"
  re := regexp.MustCompile(`(\w+): (\w+)`)
  transformString := func(s string) string {
    m := re.FindStringSubmatch(s) // inefficiency: match again
    k, v := m[1], m[2]
    return fmt.Sprintf("%v: %v", strings.ToUpper(v), strings.ToUpper(k))
  }
  rpl := re.ReplaceAllStringFunc(str, transformString) // first match
  fmt.Println(rpl) // B: A, D: C
}

The function ReplaceAllStringSubmatchFunc() is missing from the regexp package. With this function the code would look like:

func main() {
  str := "a: b, c: d"
  re := regexp.MustCompile(`(\w+): (\w+)`)
  transformSubmatch := func(m []string) string {
    k, v := m[1], m[2]
    return fmt.Sprintf("%v: %v", strings.ToUpper(v), strings.ToUpper(k))
  }
  rpl := re.ReplaceAllStringSubmatchFunc(str, transformSubmatch) // new function
  fmt.Println(rpl) // B: A, D: C
}

I'm looking forward for the ReplaceAllStringSubmatchFunc() to be included into the regexp package, as this situation is quite recurring.

Thank you!