golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
123.73k stars 17.63k forks source link

strings: ToLower gives wrong result for uppercase Σ in the word-final position #33005

Open zurk opened 5 years ago

zurk commented 5 years ago

What version of Go are you using (go version)?

$ go version
go version go1.12.5 darwin/amd64

Does this issue reproduce with the latest release?

yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/k/Library/Caches/go-build"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOOS="darwin"
GOPATH="/Users/k/go"
GOPROXY=""
GORACE=""
GOROOT="/usr/local/go"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GCCGO="gccgo"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD=""
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/kw/93jybvs16_954hytgsq6ld7r0000gn/T/go-build305684975=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

https://play.golang.org/p/fEDCPSV7Dqi

What did you expect to see?

The program output should be β︎δℕ︎ς because if you lowercase Σ at the last position of the word it becomes ς. See https://en.wikipedia.org/wiki/Sigma

Sigma (uppercase Σ, lowercase σ, lowercase in word-final position ς;

What did you see instead?

The output is β︎δℕ︎σ.


I am not sure it is the only case in all languages when lower case depends on the position. I just faced different behavior with python code:

t = "β︎Δℕ︎Σ"
print(t.lower()) # output: β︎δℕ︎ς
agnivade commented 5 years ago

Does this need another unicode.SpecialCase in https://golang.org/pkg/strings/#ToLowerSpecial ?

I do see a TODO in unicode/casetables.go.

@robpike @ianlancetaylor

ianlancetaylor commented 5 years ago

CC @mpvl

ALTree commented 3 years ago

Unicode case folding requires handling the final sigma special case, but the rule is overridden in a few standards; for example Appendix C of rfc7790 (PRECIS) says:

local case mapping is not applicable to small sigma or final sigma, so case mapping in the PRECIS framework always maps final sigma to small sigma, independent of context

Changing the strings.ToLower function to handle the final sigma (in full compliance with Unicode Folding rules) may break existing code relying on the current behaviour. Also from a cursory look (but I may be wrong) the current special-case mechanism in unicode does not support context-sensitive replacement rules, so it may be not trivial to implement the rule in a non-hacky way.

On the other hand, the text/cases package handles the final sigma special case, and also provides a way to get a PRECIS compliant folding:

package main

import (
    "fmt"

    "golang.org/x/text/cases"
    "golang.org/x/text/language"
)

func main() {
    greekLower1 := cases.Lower(language.Greek)
    greekLower2 := cases.Lower(language.Greek, cases.HandleFinalSigma(false))

    fmt.Println(greekLower1.String("β︎Δℕ︎Σ"))   // prints β︎δℕ︎ς
    fmt.Println(greekLower2.String("β︎Δℕ︎Σ"))   // prints β︎δℕ︎σ
}

My proposal is to preserve the existing strings behaviour, and maybe add a small note about the final sigma handling in the documentation, and to point users to the text/cases package for full Unicode Compliant folding.