temoto / robotstxt

The robots.txt exclusion protocol implementation for Go language
MIT License
269 stars 55 forks source link

panic: runtime error: invalid memory address or nil pointer dereference #20

Closed crosscue closed 5 years ago

crosscue commented 6 years ago

During crawling with https://github.com/gocolly/colly this would periodically be thrown:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x796daf]
goroutine 245473 [running]:
github.com/temoto/robotstxt.(*RobotsData).FindGroup(0x0, 0x9121e2, 0x28, 0xc42f2f4dd8)
    /mnt/hgfs/test-repo/go/src/github.com/temoto/robotstxt/robotstxt.go:166 +0x6f

I locally modified lines 166-180 of robotstxt.go as follows, and it appears to work okay now:

    if ret != nil {
        if ret == r.groups["*"] {
            // Weakest match possible
            prefixLen = 1
        }

        for a, g := range r.groups {
            if a != "*" && strings.HasPrefix(agent, a) {
                if l := len(a); l > prefixLen {
                    prefixLen = l
                    ret = g`
                }}}}
temoto commented 6 years ago

@crosscue could you share robots file that produces this error?

narzeja commented 6 years ago

I get the same(similar?) error in my tests in my private project. The robots.txt in question:

User-agent: *
Allow: /allowed
Disallow: /disallowed

The modification above did not work for me. Maybe r.groups is not created correctly?

Actual error:

--- FAIL: TestRobotsTxt (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
    panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x664d79]

goroutine 36 [running]:
testing.tRunner.func1(0xc4201ae000)
    /usr/lib/go/src/testing/testing.go:742 +0x29d
panic(0x6bbb80, 0x8ae4b0)
    /usr/lib/go/src/runtime/panic.go:505 +0x229
censored/vendor/github.com/temoto/robotstxt.(*RobotsData).FindGroup(0x0, 0x72039d, 0x19, 0xf)
    /censored/vendor/github.com/temoto/robotstxt/robotstxt.go:167 +0x69
temoto commented 6 years ago

@narzeja can you share code that is checking that robotstxt content?

narzeja commented 6 years ago

Hold on, I messed around my own code and have fixed it. My code in question:

        robot, ok := e.robotsCache[u.Host]
    if !ok {
        resp, _ := downloader.Client.Get(u.Scheme + "://" + u.Host + "/robots.txt")
        robot, err := robotstxt.FromResponse(resp)
        if err != nil {
            return err
        }
        e.robotsCache[u.Host] = robot
    }
        group := robot.FindGroup(e.UserAgent)

The problem here is that robot, err := robotstxt.FromResponse(resp) declares robot and err as new variables in that local scope, while the robot variable from the cache lookup is now pointing to an invalid address/nil pointer, thus robot.FindGroup panics.

The solution is to change the line to robot, err = robotstxt.FromResponse(resp).

So long story short, my code was rubbish, yours is fine :). I am fairly certain you can close the issue. Thank you for your time.

temoto commented 6 years ago

@narzeja did you also use gocolly?

crosscue commented 6 years ago

The robots code within gocolly that causes the same issue:

func (c *Collector) checkRobots(u *url.URL) error {
    // var robot *robotstxt.RobotsData
    // var ok bool
    var err error

    c.lock.RLock()
    robot, ok := c.robotsMap[u.Host]
    c.lock.RUnlock()

    if !ok {
        // no robots file cached
        resp, _ := c.backend.Client.Get(u.Scheme + "://" + u.Host + "/robots.txt")
        robot, err = robotstxt.FromResponse(resp)
        if err != nil {
            return err
        }
        c.lock.Lock()
        c.robotsMap[u.Host] = robot
        c.lock.Unlock()
    }

    uaGroup := robot.FindGroup(c.UserAgent)
    if uaGroup == nil {
        return nil
    }

    if !uaGroup.Test(u.EscapedPath()) {
        return ErrRobotsTxtBlocked
    }
    return nil
}

So as narzeja mentions, it's the line robot, err = robotstxt.FromResponse(resp) causing the issue, rather than using: robot, err := robotstxt.FromResponse(resp)

temoto commented 6 years ago

Maybe I'm missing something but gocolly code seems fine.

This is different from narzeja code.

@crosscue can you share robots.txt contents that cause trouble with gocolly?

temoto commented 6 years ago

@narzeja beware of map access races, gocolly surrounding cache with lock is correct.

narzeja commented 6 years ago

Thank you @temoto, I added RWMutex to my code.