temoto / robotstxt

The robots.txt exclusion protocol implementation for Go language
MIT License
269 stars 55 forks source link

optimize Group.findRule #27

Closed moredure closed 3 years ago

temoto commented 3 years ago

replacer/asterisk is very confusing change. If this is a draft work, you can mark pull request as draft with a button here.

moredure commented 3 years ago

In practice TrimRight using TrimRightFunc internally, just no allocation, replacer also should allocate less one time string building vs two. Also we can use EqualFold to reduce one more memory allocation of strings.ToLower in parseLine function, I guess Disallow/Allow/Crawl-Delay tokens very frequently used in camelcase, so it will be exactly one allocation less in this cases in practice such tokens might appear more than once per robots.txt across the internet so -4-6 allocations per document.

moredure commented 3 years ago
BenchmarkFromBytes/One_Replacer
BenchmarkFromBytes/One_Replacer-4            7473553           149 ns/op          48 B/op          3 allocs/op
BenchmarkFromBytes/Two_strings.Replace
BenchmarkFromBytes/Two_strings.Replace-4     8631030           139 ns/op          24 B/op          4 allocs/op

Maybe replacer a little bit slower, but one allocation less. But with larger lines faster:

BenchmarkFromBytes/One_Replacer
BenchmarkFromBytes/One_Replacer-4            6257397           180 ns/op          96 B/op          3 allocs/op
BenchmarkFromBytes/Two_strings.Replace
BenchmarkFromBytes/Two_strings.Replace-4     6205563           196 ns/op         128 B/op          4 allocs/op

TrimRight vs TrimRightFunc:

BenchmarkTrimFunc/TrimRightFunc
BenchmarkTrimFunc/TrimRightFunc-4           90882796            12.4 ns/op         0 B/op          0 allocs/op
BenchmarkTrimFunc/TrimRight
BenchmarkTrimFunc/TrimRight-4               24250952            46.0 ns/op        32 B/op          1 allocs/op
moredure commented 3 years ago

Check for yourself:

package robotstxt

import (
    "strings"
    "testing"
)

var replacer = strings.NewReplacer(`\*`, `.*`, `\$`, `$`)

func star(r rune) bool {
    return r == '*'
}

func BenchmarkReplacer(b *testing.B) {
    t2 := `sdfsd \*\$\* sdfsdf`
    b.Run("One Replacer", func(b *testing.B) {
        b.ReportAllocs()
        for i := 0; i < b.N; i ++  {
            replacer.Replace(t2)
        }
    })

    b.Run("Two strings.Replace", func(b *testing.B) {
        b.ReportAllocs()
        for i := 0; i < b.N; i ++  {
            strings.Replace(strings.Replace(t2, `\*`, `.*`, -1), `\$`, `$`, -1)
        }
    })
}

func BenchmarkTrimFunc(b *testing.B) {
    b.Run("TrimRightFunc", func(b *testing.B) {
        b.ReportAllocs()
        for i := 0; i < b.N; i ++  {
            strings.TrimRightFunc("sdfsdffsdfsdf*", star)
        }
    })

    b.Run("TrimRight", func(b *testing.B) {
        b.ReportAllocs()
        for i := 0; i < b.N; i ++  {
            strings.TrimRight("sdfsdffsdfsdf*", "*")
        }
    })
}

func BenchmarkEqualFold(b *testing.B) {
    t := `Disallow`
    b.Run("EqualFold inside switch", func(b *testing.B) {
        b.ReportAllocs()
        for i := 0; i < b.N; i ++  {
            switch {
            case strings.EqualFold(t, "disallow"):
            case strings.EqualFold(t, "allow"):
            case strings.EqualFold(t, "useragent"):
            case strings.EqualFold(t, "user-agent"):
            case strings.EqualFold(t, "crawl-delay"):
            case strings.EqualFold(t, "crawldelay"):
            }
        }
    })

    b.Run("To Lower then comare in switch", func(b *testing.B) {
        b.ReportAllocs()
        for i := 0; i < b.N; i ++  {
            switch strings.ToLower(t) {
            case "disallow":
            case "allow":
            case "useragent":
            case "user-agent":
            case "crawl-delay":
            case "crawldelay":
            }
        }
    })
}

In case of equal fold

BenchmarkEqualFold
BenchmarkEqualFold/EqualFold_inside_switch
BenchmarkEqualFold/EqualFold_inside_switch-4            56437368            19.2 ns/op         0 B/op          0 allocs/op
BenchmarkEqualFold/To_Lower_then_comare_in_switch
BenchmarkEqualFold/To_Lower_then_comare_in_switch-4     30694930            37.3 ns/op         8 B/op          1 allocs/op

One less allocation and 2 times faster.

codecov[bot] commented 3 years ago

Codecov Report

Merging #27 (99dc19f) into master (87d22cd) will decrease coverage by 0.82%. The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #27      +/-   ##
==========================================
- Coverage   92.07%   91.25%   -0.83%     
==========================================
  Files           3        3              
  Lines         303      263      -40     
==========================================
- Hits          279      240      -39     
+ Misses         13       12       -1     
  Partials       11       11              
Impacted Files Coverage Δ
robotstxt.go 92.85% <100.00%> (-1.20%) :arrow_down:
scanner.go 90.78% <0.00%> (-0.70%) :arrow_down:
parser.go 90.59% <0.00%> (-0.61%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 87d22cd...99dc19f. Read the comment docs.

moredure commented 3 years ago

Looks like for replacer benchmark is worser when no symbols to replace (no asterisks, etc), I'll revert this changes

moredure commented 3 years ago

In worst case scenario (last switch case) equal-fold might be a little bit slower, let's take a look in benchmarks in travis.

temoto commented 3 years ago

Check for yourself:

package robotstxt
func BenchmarkReplacer(b *testing.B) {
func BenchmarkTrimFunc(b *testing.B) {
func BenchmarkEqualFold(b *testing.B) {

This should be in patch.

moredure commented 3 years ago

Let's start from small changes)

moredure commented 3 years ago

@temoto