golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.89k stars 17.52k forks source link

regexp/syntax: named capture groups don't support non-latin alphabets #64678

Open igorzhilianin opened 9 months ago

igorzhilianin commented 9 months ago

Go version

go version go1.21.4 linux/amd64

What operating system and processor architecture are you using (go env)?

GO111MODULE=''
GOARCH='amd64'
GOBIN=''
GOCACHE='/root/.cache/go-build'
GOENV='/root/.config/go/env'
GOEXE=''
GOEXPERIMENT=''
GOFLAGS=''
GOHOSTARCH='amd64'
GOHOSTOS='linux'
GOINSECURE=''
GOMODCACHE='/root/go/pkg/mod'
GONOPROXY=''
GONOSUMDB=''
GOOS='linux'
GOPATH='/root/go'
GOPRIVATE=''
GOPROXY='https://proxy.golang.org,direct'
GOROOT='/usr/lib/go-1.21'
GOSUMDB='sum.golang.org'
GOTMPDIR=''
GOTOOLCHAIN='auto'
GOTOOLDIR='/usr/lib/go-1.21/pkg/tool/linux_amd64'
GOVCS=''
GOVERSION='go1.21.4'
GCCGO='gccgo'
GOAMD64='v1'
AR='ar'
CC='gcc'
CXX='g++'
CGO_ENABLED='1'
GOMOD='/dev/null'
GOWORK=''
CGO_CFLAGS='-O2 -g'
CGO_CPPFLAGS=''
CGO_CXXFLAGS='-O2 -g'
CGO_FFLAGS='-O2 -g'
CGO_LDFLAGS='-O2 -g'
PKG_CONFIG='pkg-config'
GOGCCFLAGS='-fPIC -m64 -pthread -Wl,--no-gc-sections -fmessage-length=0 -ffile-prefix-map=/tmp/go-build1924873596=/tmp/go-build -gno-record-gcc-switches'

What did you do?

Python's re and google/re2 has no issue compiling named capture groups with international characters.

Go doesn't support it, as you could see here running this sample: https://go.dev/play/p/d1pVihwOznE

package main

import (
    "regexp"
)

func main() {
    regexp.MustCompile(`(?P<тест>a)`)
}

What did you expect to see?

No errors.

What did you see instead?

panic: regexp: Compile(`(?P<тест>a)`): error parsing regexp: invalid named capture: `(?P<тест>`

goroutine 1 [running]:
regexp.MustCompile({0x485746, 0xf})
    /usr/lib/go-1.21/src/regexp/regexp.go:319 +0xb4
main.main()
    /root/test.go:8 +0x1f
gopherbot commented 9 months ago

Change https://go.dev/cl/548997 mentions this issue: regexp/syntax: allow extended Unicode characters in capture names

prattmic commented 9 months ago

cc @rsc

seankhliao commented 9 months ago

we rejected #60784 a while ago, though re2 allows more unicode these https://github.com/google/re2/commit/6a994180b85293eafcce21d9f3eb8a3526498248