proposal: unicode: improvement of rune-checking funcs

diegommm commented 4 months ago

Proposal Details

I was writing a text parser and came across the need to check some runes repeatedly. I also used a lot functions like utf8.ValidRune, unicode.IsSpace, etc. I found some of them were slower compared to some versions of other rune-checker funcs I wrote for the parser, so I setup a repo to make a PoC with my findings. I also found that using unicode.RangeTable is for a very specific use cases, and is not very ergonomic for many mundane use cases. It's also true that I found myself calling Contains and Index family of functions in strings and bytes packages, but many times with the same "needles" (things looked up) for many different "haystacks" (where we looked them up). In some cases, these functions will create an internal, temporary data structure to aid with the lookup, but they are discarded and cannot be reused. It would be great if they could be reused somehow.

(aside: I don't want to bother with the needle/haystack terminology, feel free to point me to more Go-ish alternatives).

The proposal can be divided in two items:

Add a way to create efficient func(rune) bool functions so they can be reused many times. Bikeshedding starter:
1. unicode.IsBytesFunc([]byte) func(rune) bool
2. unicode.IsRunesFunc([]rune) func(rune) bool
3. unicode.IsStringFunc(string) func(rune) bool
Improve the performance of some implementations of these kind of functions in unicode, utf8, strings and bytes packages, especially when there are runes > utf8.RuneSelf. I would like to contribute to the project, so let me know if these are changes that are wanted in the standard library and to what extent should they be implemented (there might be other things to consider that I'm missing to strike the right balance for the project). If so, I can go through the contribution guides and do all that stuff once the "what" is agreed, we can discuss starting with code in the PoC repo (which can have some cleanup for sure).

If both items cannot be attended in this same issue, I would prefer that number 1 is addressed here, as that is more of a proposal, and number 2 sounds more like an improvement that doesn't change API and can probably be addressed in other place (and please, could you kindly point me where to do it).

benchstat results for unicode and utf8 improvements

``` goos: linux goarch: amd64 pkg: github.com/diegommm/runes cpu: 11th Gen Intel(R) Stdlib/IsDigit/rune=-1-8 Stdlib/IsDigit/rune=48-8 Stdlib/IsDigit/rune=97-8 Stdlib/IsDigit/rune=209-8 Stdlib/IsDigit/rune=26412-8 Stdlib/IsDigit/rune=55296-8 Stdlib/IsDigit/rune=128169-8 Stdlib/IsDigit/rune=130041-8 Stdlib/IsDigit/rune=33554432-8 Stdlib/IsLower/rune=-1-8 Stdlib/IsLower/rune=97-8 Stdlib/IsLower/rune=209-8 Stdlib/IsLower/rune=26412-8 Stdlib/IsLower/rune=55296-8 Stdlib/IsLower/rune=125251-8 Stdlib/IsLower/rune=128169-8 Stdlib/IsLower/rune=33554432-8 Stdlib/IsLower/rune=0-8 Stdlib/IsPrint/rune=-1-8 Stdlib/IsPrint/rune=32-8 Stdlib/IsPrint/rune=97-8 Stdlib/IsPrint/rune=209-8 Stdlib/IsPrint/rune=26412-8 Stdlib/IsPrint/rune=55296-8 Stdlib/IsPrint/rune=128169-8 Stdlib/IsPrint/rune=917999-8 Stdlib/IsPrint/rune=33554432-8 Stdlib/IsTitle/rune=-1-8 Stdlib/IsTitle/rune=97-8 Stdlib/IsTitle/rune=209-8 Stdlib/IsTitle/rune=453-8 Stdlib/IsTitle/rune=8188-8 Stdlib/IsTitle/rune=26412-8 Stdlib/IsTitle/rune=55296-8 Stdlib/IsTitle/rune=128169-8 Stdlib/IsTitle/rune=33554432-8 Stdlib/IsNumber/rune=-1-8 Stdlib/IsNumber/rune=48-8 Stdlib/IsNumber/rune=97-8 Stdlib/IsNumber/rune=209-8 Stdlib/IsNumber/rune=26412-8 Stdlib/IsNumber/rune=55296-8 Stdlib/IsNumber/rune=128169-8 Stdlib/IsNumber/rune=130041-8 Stdlib/IsNumber/rune=33554432-8 Stdlib/IsSymbol/rune=-1-8 Stdlib/IsSymbol/rune=36-8 Stdlib/IsSymbol/rune=97-8 Stdlib/IsSymbol/rune=209-8 Stdlib/IsSymbol/rune=26412-8 Stdlib/IsSymbol/rune=55296-8 Stdlib/IsSymbol/rune=128169-8 Stdlib/IsSymbol/rune=129994-8 Stdlib/IsSymbol/rune=33554432-8 Stdlib/IsUpper/rune=-1-8 Stdlib/IsUpper/rune=65-8 Stdlib/IsUpper/rune=97-8 Stdlib/IsUpper/rune=209-8 Stdlib/IsUpper/rune=26412-8 Stdlib/IsUpper/rune=55296-8 Stdlib/IsUpper/rune=125217-8 Stdlib/IsUpper/rune=128169-8 Stdlib/IsUpper/rune=33554432-8 Stdlib/IsControl/rune=-1-8 Stdlib/IsControl/rune=0-8 Stdlib/IsControl/rune=97-8 Stdlib/IsControl/rune=159-8 Stdlib/IsControl/rune=209-8 Stdlib/IsControl/rune=26412-8 Stdlib/IsControl/rune=55296-8 Stdlib/IsControl/rune=128169-8 Stdlib/IsControl/rune=33554432-8 Stdlib/IsGraphic/rune=-1-8 Stdlib/IsGraphic/rune=32-8 Stdlib/IsGraphic/rune=97-8 Stdlib/IsGraphic/rune=209-8 Stdlib/IsGraphic/rune=26412-8 Stdlib/IsGraphic/rune=55296-8 Stdlib/IsGraphic/rune=128169-8 Stdlib/IsGraphic/rune=917999-8 Stdlib/IsGraphic/rune=33554432-8 Stdlib/IsLetter/rune=-1-8 Stdlib/IsLetter/rune=65-8 Stdlib/IsLetter/rune=97-8 Stdlib/IsLetter/rune=209-8 Stdlib/IsLetter/rune=26412-8 Stdlib/IsLetter/rune=55296-8 Stdlib/IsLetter/rune=128169-8 Stdlib/IsLetter/rune=205743-8 Stdlib/IsLetter/rune=33554432-8 Stdlib/IsMark/rune=-1-8 Stdlib/IsMark/rune=97-8 Stdlib/IsMark/rune=209-8 Stdlib/IsMark/rune=768-8 Stdlib/IsMark/rune=26412-8 Stdlib/IsMark/rune=55296-8 Stdlib/IsMark/rune=128169-8 Stdlib/IsMark/rune=917999-8 Stdlib/IsMark/rune=33554432-8 Stdlib/IsPunct/rune=-1-8 Stdlib/IsPunct/rune=33-8 Stdlib/IsPunct/rune=97-8 Stdlib/IsPunct/rune=209-8 Stdlib/IsPunct/rune=26412-8 Stdlib/IsPunct/rune=55296-8 Stdlib/IsPunct/rune=125279-8 Stdlib/IsPunct/rune=128169-8 Stdlib/IsPunct/rune=33554432-8 Stdlib/IsSpace/rune=-1-8 Stdlib/IsSpace/rune=9-8 Stdlib/IsSpace/rune=97-8 Stdlib/IsSpace/rune=209-8 Stdlib/IsSpace/rune=12288-8 Stdlib/IsSpace/rune=26412-8 Stdlib/IsSpace/rune=55296-8 Stdlib/IsSpace/rune=128169-8 Stdlib/IsSpace/rune=33554432-8 geomean ``` Core(TM) i7-1165G7 @ 2.80GHz │ stdlib │ local │ │ sec/op │ sec/op vs base │ 1.382n ± 2% 1.296n ± 4% -6.22% (p=0.000 n=20) 1.643n ± 1% 1.215n ± 4% -26.07% (p=0.000 n=20) 1.632n ± 1% 1.201n ± 2% -26.43% (p=0.000 n=20) 1.642n ± 2% 1.210n ± 5% -26.32% (p=0.000 n=20) 7.547n ± 2% 1.198n ± 1% -84.12% (p=0.000 n=20) 7.686n ± 1% 1.220n ± 1% -84.13% (p=0.000 n=20) 7.380n ± 1% 1.207n ± 1% -83.64% (p=0.000 n=20) 6.396n ± 3% 1.212n ± 1% -81.06% (p=0.000 n=20) 6.628n ± 2% 1.289n ± 5% -80.55% (p=0.000 n=20) 3.033n ± 2% 1.290n ± 0% -57.47% (p=0.000 n=20) 1.705n ± 3% 1.216n ± 4% -28.65% (p=0.000 n=20) 1.712n ± 3% 1.211n ± 2% -29.28% (p=0.000 n=20) 8.889n ± 2% 1.195n ± 2% -86.56% (p=0.000 n=20) 9.441n ± 3% 1.213n ± 2% -87.16% (p=0.000 n=20) 7.268n ± 1% 1.194n ± 2% -83.57% (p=0.000 n=20) 7.566n ± 3% 1.317n ± 3% -82.59% (p=0.000 n=20) 7.747n ± 5% 1.287n ± 0% -83.39% (p=0.000 n=20) 1.813n ± 2% 1.289n ± 5% -28.90% (p=0.000 n=20) 11.520n ± 3% 1.291n ± 0% -88.79% (p=0.000 n=20) 1.731n ± 2% 1.199n ± 1% -30.73% (p=0.000 n=20) 1.738n ± 1% 1.201n ± 2% -30.92% (p=0.000 n=20) 1.707n ± 4% 1.218n ± 4% -28.63% (p=0.000 n=20) 10.620n ± 2% 1.218n ± 2% -88.53% (p=0.000 n=20) 49.325n ± 2% 1.197n ± 2% -97.57% (p=0.000 n=20) 39.490n ± 1% 1.197n ± 2% -96.97% (p=0.000 n=20) 18.710n ± 3% 1.214n ± 2% -93.51% (p=0.000 n=20) 39.165n ± 6% 1.286n ± 5% -96.72% (p=0.000 n=20) 1.546n ± 4% 1.295n ± 4% -16.18% (p=0.000 n=20) 1.571n ± 4% 1.295n ± 4% -17.57% (p=0.000 n=20) 1.582n ± 6% 1.286n ± 0% -18.68% (p=0.000 n=20) 5.665n ± 2% 1.213n ± 4% -78.59% (p=0.000 n=20) 7.506n ± 6% 1.204n ± 1% -83.96% (p=0.000 n=20) 3.200n ± 5% 1.292n ± 4% -59.61% (p=0.000 n=20) 3.206n ± 4% 1.287n ± 0% -59.86% (p=0.000 n=20) 3.194n ± 3% 1.287n ± 5% -59.68% (p=0.000 n=20) 3.005n ± 2% 1.287n ± 0% -57.15% (p=0.000 n=20) 3.024n ± 2% 1.327n ± 3% -56.11% (p=0.000 n=20) 1.602n ± 2% 1.214n ± 2% -24.19% (p=0.000 n=20) 1.605n ± 1% 1.221n ± 5% -23.93% (p=0.000 n=20) 1.584n ± 1% 1.202n ± 2% -24.15% (p=0.000 n=20) 8.251n ± 7% 1.202n ± 2% -85.43% (p=0.000 n=20) 8.683n ± 3% 1.209n ± 1% -86.08% (p=0.000 n=20) 8.423n ± 3% 1.197n ± 3% -85.79% (p=0.000 n=20) 8.373n ± 3% 1.200n ± 1% -85.67% (p=0.000 n=20) 8.539n ± 6% 1.287n ± 0% -84.92% (p=0.000 n=20) 3.163n ± 5% 1.321n ± 2% -58.26% (p=0.000 n=20) 1.683n ± 3% 1.214n ± 4% -27.92% (p=0.000 n=20) 1.601n ± 1% 1.211n ± 5% -24.36% (p=0.000 n=20) 1.581n ± 1% 1.195n ± 1% -24.39% (p=0.000 n=20) 9.007n ± 3% 1.216n ± 4% -86.50% (p=0.000 n=20) 7.319n ± 2% 1.214n ± 1% -83.41% (p=0.000 n=20) 8.978n ± 2% 1.211n ± 1% -86.51% (p=0.000 n=20) 8.339n ± 3% 1.211n ± 1% -85.48% (p=0.000 n=20) 8.183n ± 2% 1.291n ± 0% -84.22% (p=0.000 n=20) 2.943n ± 1% 1.288n ± 4% -56.22% (p=0.000 n=20) 1.754n ± 2% 1.197n ± 6% -31.78% (p=0.000 n=20) 1.746n ± 2% 1.194n ± 0% -31.63% (p=0.000 n=20) 1.730n ± 6% 1.212n ± 4% -29.94% (p=0.000 n=20) 10.400n ± 4% 1.216n ± 4% -88.31% (p=0.000 n=20) 10.775n ± 2% 1.204n ± 1% -88.83% (p=0.000 n=20) 7.255n ± 3% 1.196n ± 1% -83.52% (p=0.000 n=20) 7.401n ± 4% 1.286n ± 0% -82.63% (p=0.000 n=20) 7.433n ± 3% 1.290n ± 5% -82.64% (p=0.000 n=20) 1.389n ± 1% 1.296n ± 4% -6.69% (p=0.000 n=20) 1.175n ± 1% 1.224n ± 4% +4.13% (p=0.000 n=20) 1.190n ± 2% 1.198n ± 2% +0.63% (p=0.050 n=20) 1.211n ± 4% 1.213n ± 4% ~ (p=0.365 n=20) 1.175n ± 0% 1.292n ± 1% +10.00% (p=0.000 n=20) 1.406n ± 4% 1.291n ± 1% -8.18% (p=0.000 n=20) 1.387n ± 6% 1.290n ± 0% -7.03% (p=0.000 n=20) 1.421n ± 3% 1.295n ± 4% -8.90% (p=0.000 n=20) 1.458n ± 3% 1.292n ± 4% -11.36% (p=0.000 n=20) 13.175n ± 4% 1.287n ± 5% -90.23% (p=0.000 n=20) 1.737n ± 3% 1.202n ± 1% -30.75% (p=0.000 n=20) 1.736n ± 2% 1.213n ± 4% -30.16% (p=0.000 n=20) 1.658n ± 1% 1.217n ± 4% -26.60% (p=0.000 n=20) 10.055n ± 2% 1.198n ± 6% -88.09% (p=0.000 n=20) 46.660n ± 2% 1.213n ± 4% -97.40% (p=0.000 n=20) 38.305n ± 1% 1.220n ± 4% -96.82% (p=0.000 n=20) 18.635n ± 8% 1.196n ± 2% -93.58% (p=0.000 n=20) 45.490n ± 5% 1.291n ± 1% -97.16% (p=0.000 n=20) 3.203n ± 3% 1.289n ± 1% -59.76% (p=0.000 n=20) 1.632n ± 3% 1.197n ± 1% -26.65% (p=0.000 n=20) 1.598n ± 1% 1.201n ± 2% -24.84% (p=0.000 n=20) 1.603n ± 2% 1.214n ± 4% -24.26% (p=0.000 n=20) 10.205n ± 9% 1.201n ± 2% -88.24% (p=0.000 n=20) 11.520n ± 6% 1.212n ± 5% -89.48% (p=0.000 n=20) 10.395n ± 3% 1.214n ± 1% -88.33% (p=0.000 n=20) 10.135n ± 2% 1.215n ± 2% -88.01% (p=0.000 n=20) 10.180n ± 2% 1.288n ± 1% -87.35% (p=0.000 n=20) 3.040n ± 2% 1.296n ± 5% -57.34% (p=0.000 n=20) 4.468n ± 2% 1.295n ± 1% -71.00% (p=0.000 n=20) 4.430n ± 2% 1.311n ± 3% -70.41% (p=0.000 n=20) 8.604n ± 2% 1.236n ± 2% -85.63% (p=0.000 n=20) 9.505n ± 4% 1.247n ± 2% -86.89% (p=0.000 n=20) 10.530n ± 1% 1.218n ± 1% -88.43% (p=0.000 n=20) 8.802n ± 3% 1.198n ± 2% -86.39% (p=0.000 n=20) 8.084n ± 2% 1.223n ± 2% -84.87% (p=0.000 n=20) 8.223n ± 3% 1.300n ± 1% -84.19% (p=0.000 n=20) 3.086n ± 8% 1.288n ± 1% -58.25% (p=0.000 n=20) 1.698n ± 4% 1.213n ± 1% -28.59% (p=0.000 n=20) 1.677n ± 5% 1.220n ± 4% -27.22% (p=0.000 n=20) 1.665n ± 4% 1.215n ± 2% -27.05% (p=0.000 n=20) 8.579n ± 5% 1.197n ± 2% -86.05% (p=0.000 n=20) 9.496n ± 6% 1.193n ± 2% -87.44% (p=0.000 n=20) 7.013n ± 7% 1.198n ± 1% -82.91% (p=0.000 n=20) 7.309n ± 1% 1.287n ± 5% -82.39% (p=0.000 n=20) 7.533n ± 3% 1.287n ± 5% -82.91% (p=0.000 n=20) 3.048n ± 3% 1.288n ± 0% -57.73% (p=0.000 n=20) 1.754n ± 5% 1.196n ± 1% -31.82% (p=0.000 n=20) 1.679n ± 5% 1.208n ± 1% -28.07% (p=0.000 n=20) 1.669n ± 1% 1.195n ± 2% -28.42% (p=0.000 n=20) 6.161n ± 2% 1.240n ± 3% -79.87% (p=0.000 n=20) 3.161n ± 3% 1.287n ± 0% -59.29% (p=0.000 n=20) 3.062n ± 3% 1.286n ± 1% -57.98% (p=0.000 n=20) 3.080n ± 2% 1.291n ± 4% -58.06% (p=0.000 n=20) 3.067n ± 3% 1.288n ± 1% -58.01% (p=0.000 n=20) 4.342n 1.240n -71.45%

gabyhelp commented 4 months ago

Similar Issues

_{(Emoji vote if this was helpful or unhelpful; more detailed feedback welcome in this discussion.)}

ianlancetaylor commented 4 months ago

Can you show us the doc comments for the three functions that you are proposing? It's not obvious what they do.

diegommm commented 4 months ago

Doc comment

Can you show us the doc comments for the three functions that you are proposing? It's not obvious what they do.

The three do the same but with different input types. A possible doc comment for unicode.IsStringFunc(string) func(rune) bool is:

// IsStringFunc returns a function that checks if a rune is found in the provided string. Example:
//
//  openingRune := IsStringFunc("([{<《（")
//  for s != "" {
//      i := strings.IndexFunc(s, openingRune)
//      // if i >= 0 ...
//  }

I initially considered the name IsInStringFunc, which I think may sound clearer, but then dropped the In word to make it more like unicode.IsSpace.

More on motivation using the previous example

Note that it is perfectly fine to use strings.IndexAny(s, "([{<《（"), but then in each iteration strings.IndexAny goes through some strategies and ends up doing the following:

for i, c := range s {
    if IndexRune(chars, c) >= 0 { // IndexRune("([{<《（", c)
        return i
    }
}

Then IndexRune calls IndexByte as long as it works on byte-sized c runes, which is great because it's very fast. But if my text has a lot of non-ASCII text (e.g. if you're parsing chinese, then your c will likely be ASCII very few times) then it will call Index(s, string(r)). And there it gets hairy. For our use case, it will end up calling bytealg.IndexString("([{<《（", c), but if instead of "([{<《（" we have something longer (>31 even on some -old, pre 2013- amd64) then it may still try a few more looping tricks.

Having said all that, it is no longer clear what's going to be the O of strings.IndexAny for our mostly non-ASCII case. It may be some O(x * y * z) in the worst case (I may be wrong in the * z, after all there's some architecture dependent stuff too). And that should be multiplied by the number of iterations in our enclosing for loop.

If we had the chance to cache the strategy of what needs to be searched, then that could be reused. In the shared repo, I'm proposing a solution that provides constant time when checking any rune. So calling strings.IndexFunc(s, openingRune) becomes O( utf8.RuneCountString(s) ).

diegommm commented 4 months ago

Another option:

IsInSet([]rune) func(rune) bool
IsInSetString(string) func(rune) bool
IsInSetBytes([]byte) func(rune) bool

(Thinking that we provide a "RuneSet" to match against) And yet another one:

func IsInSet[S interface{ []rune | string | []byte }](s S) func(rune) bool {
    swtich ss := any(s).(type){
    case []rune:
    // ...
    }
}

ianlancetaylor commented 4 months ago

This sounds like something that might be appropriate for a third party library outside of the standard library. https://go.dev/doc/faq#x_in_std

diegommm commented 4 months ago

This sounds like something that might be appropriate for a third party library outside of the standard library.

I'm thinking of these functions as a way to compile a set of runes to be later matched, much like a regular expression, but for runes. The performance advantage is huge since this is done everywhere, and given that rune matching is cached and you can reuse that in parsing loops. Sure, regular expressions are an infinitely more complex case, but consider that even in the standard library there are many hot paths in parsers using functions like bytes.IndexAny and bytes.ContainsAny. Imagine if instead of providing regexp.Compile() we would just have regexp.Match(regexp string, test string) (EDIT: this is an exaggeration, just to exemplify the concept).

I found that the bitmask approach even performs better than a func checking with a switch a handful of runes. This is very common. There are also cases where we want to check if it's just any of " \t\n", so we call one of the IndexAny functions. This is also ubiquitous. And it is not always just ASCII text being queried, take as an example html/template/escape.go, where it is parsing HTML and it needs to handle some JavaScript space characters and calls bytes.ContainsAny(s[written:i1], "\n\r\u2028\u2029"). There's also the case of mime/grammar.go:

func isToken(s string) bool {
        if s == "" {
                return false
        }
        return strings.IndexFunc(s, isNotTokenChar) < 0
}

func isNotTokenChar(r rune) bool {
        return !isTokenChar(r)
}

func isTokenChar(r rune) bool {
        // token := 1*<any (US-ASCII) CHAR except SPACE, CTLs,
        //             or tspecials>
        return r > 0x20 && r < 0x7f && !isTSpecial(r)
}

func isTSpecial(r rune) bool {
        return strings.ContainsRune(`()<>@,;:\"/[]?=`, r)
}

Now consider this alternative:

var (
        isTSpecialChars = `()<>@,;:\"/[]?=`
        isTSpecial = whatever.IsInStringSet(isTSpecialChars)
        isTokenChar func(rune) bool
)

func init(){
        runes := make([]rune, 0, 0x7f - 0x21 - len(isTSpecialChars))
        for r := rune(0x21); r < 0x7f; r++ {
                if !isTSpecial(r) {
                        runes = append(runes, r)
                }
        }
        isTokenChar = whatever.IsInRunesSet(runes)
}

func isNotTokenChar(r rune) bool {
        return !isTokenChar(r)
}

Note that all these function checks are done in constant time in the approach I'm proposing in the linked repo, as opposed to O(x * y) (and possibly O(x * y * z)?). In my machine, the version I'm proposing runs in 52% the time of the original, using a basic benchmark with the test strings I found in mime/mediatype_test.go (actually, I duplicated the strings and added an @ at the end to force some negatives too).

Benchmark stuff

```shell goos: linux goarch: amd64 pkg: test/mime-test cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ IsToken-8 1108.0n ± 1% 574.9n ± 0% -48.11% (p=0.000 n=10) ``` The code ```go package main import ( "strings" "testing" ) // original func isToken(s string) bool { if s == "" { return false } return strings.IndexFunc(s, isNotTokenChar) < 0 } func isNotTokenChar(r rune) bool { return !isTokenChar(r) } func isTokenChar(r rune) bool { return r > 0x20 && r < 0x7f && !isTSpecial(r) } func isTSpecial(r rune) bool { return strings.ContainsRune(`()<>@,;:\"/[]?=`, r) } // changed var ( isTSpecialChars = `()<>@,;:\"/[]?=` isTSpecial2 = IsStringFunc(isTSpecialChars) isTokenChar2 func(rune) bool ) func init() { runes := make([]rune, 0, 0x7f-0x21-len(isTSpecialChars)) for r := rune(0x21); r < 0x7f; r++ { if !isTSpecial(r) { runes = append(runes, r) } } isTokenChar2 = IsRunesFunc(runes) } func isNotTokenChar2(r rune) bool { return !isTokenChar2(r) } func isToken2(s string) bool { if s == "" { return false } return strings.IndexFunc(s, isNotTokenChar2) < 0 } // bench stuff var mediaTypes = []string{ "noslash noslash @", "foo bar/baz foo bar/baz @", "foo/bar baz foo/bar baz @", "attachment attachment @", "attachment attachment @", "attachment attachment @", "foo/BAR foo/BAR @", "foo/BAR foo/BAR @", "foo/BAR foo/BAR @", "foo/BAR foo/BAR @", "foo/BAR foo/BAR @", "foo/BAR foo/BAR @", "foo/BAR foo/BAR @", "foo/BAR foo/BAR @", "foo/BAR foo/BAR @", "foo/BAR foo/BAR @", "foo/bar foo/bar @", "foo/bar foo/bar @", "foo foo @", "noslash", "foo bar/baz", "foo/bar baz", "attachment", "attachment", "attachment", "foo/BAR", "foo/BAR", "foo/BAR", "foo/BAR", "foo/BAR", "foo/BAR", "foo/BAR", "foo/BAR", "foo/BAR", "foo/BAR", "foo/bar", "foo/bar", "foo", } func BenchmarkIsToken(b *testing.B) { for i := 0; i < b.N; i++ { for _, mt := range mediaTypes { isToken(mt) } } } func BenchmarkIsToken2(b *testing.B) { for i := 0; i < b.N; i++ { for _, mt := range mediaTypes { isToken2(mt) } } } ```

This is also not something uncommon to find in the wild, I think unicode.IsSpace is a great example of what rune checking does, but we need to write parsers everywhere and we need to check for special chars as well.

Examples in standard library where this would help

This is not a final list, there are some other cases as well. | File | Func/Method | Operation | Context | |--------|--------|--------|--------| | cmd/cgo/ast.go | (*File).ParseGo | strings.ContainsAny(abspath, "\r\n") | 2nd level nested loop | | cmd/cover/cover.go | annotate | strings.ContainsAny(name, "\r\n") | 1st level loop | | cmd/cover/cover.go | (*Package).annotateFile | strings.ContainsAny(name, "\r\n") | 1st level loop | | cmd/go/internal/fsys/fsys.go | hasMeta | strings.ContainsAny(path, magicChars) | magicChars depends on OS, called twice in Glob | | cmd/vendor/github.com/google/pprof/internal/symbolizer/symbolizer.go | looksLikeDemangledCPlusPlus and removeMatching | strings.ContainsAny | called in a loop, top func is Demangle | | go/types/resolver.go | dir | strings.LastIndexAny(path, `/\`) | Called in a loop in (*Checker).collectObjects, called for each file being type-checked | | html/template/escape.go | (*escaper).escapeText and contextAfterText | bytes.IndexAny | Called in a loop at least once, thrice worst case, whenever escaping a text node, and includes funny characters because of JavaScript | | html/template/html.go | stripTags | bytes.IndexAny | called in a loop, runs in many places, processes a lot of HTML | | mime/mediatype.go | consumeToken | strings.IndexFunc and strings.ContainsRune | called in many loops | | mime/grammar.go | isToken | strings.IndexFunc and strings.ContainsRune | called several times in FormatMediaType | | log/slog/level.go | (*Level).parse | strings.IndexAny | called in UnmarshalJSON and in MarshalText | | runtime/debug/mod.go | quoteKey and quoteValue | strings.ContainsAny | called in loops to parse BuildInfo and to convert it to string | | archive/tar/reader.go | (*Reader).readHeader | bytes.IndexFunc | called in a loop to read the next block header of the tar archive each time you want to advance to the next entry in the archive, which means it's called in many other loops when iterating over the entries | | encoding/csv/writer.go | (*Writer).Write | strings.IndexAny and strings.ContainsAny | 1st and 2nd level nested loop | | encoding/pem/pem.go | bytes.ContainsAny | bytes.ContainsAny | 2nd level nested loop when decoding each PEM | | net/http/cookiejar/jar.go | isIP | strings.ContainsAny | Called in a loop to add each cookie, when setting, and when retrieving a cookie | | net/http/cookie.go | sanitizeCookieValue and isCookieNameValid | strings.IndexFunc and strings.ContainsAny | Many loops | | net/http/request.go | validMethod | strings.IndexFunc | Called every time you need to validate an HTTP method name | | net/mail/message.go | (*Address).String, (*addrParser).consumeDisplayNameComment, ParseDate | strings.FieldsFunc, strings.IndexAny, and strings.ContainsAny | Called in many loops |

golang / go