Closed crosscue closed 5 years ago
@crosscue could you share robots file that produces this error?
I get the same(similar?) error in my tests in my private project. The robots.txt in question:
User-agent: *
Allow: /allowed
Disallow: /disallowed
The modification above did not work for me. Maybe r.groups
is not created correctly?
Actual error:
--- FAIL: TestRobotsTxt (0.00s)
panic: runtime error: invalid memory address or nil pointer dereference [recovered]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x664d79]
goroutine 36 [running]:
testing.tRunner.func1(0xc4201ae000)
/usr/lib/go/src/testing/testing.go:742 +0x29d
panic(0x6bbb80, 0x8ae4b0)
/usr/lib/go/src/runtime/panic.go:505 +0x229
censored/vendor/github.com/temoto/robotstxt.(*RobotsData).FindGroup(0x0, 0x72039d, 0x19, 0xf)
/censored/vendor/github.com/temoto/robotstxt/robotstxt.go:167 +0x69
@narzeja can you share code that is checking that robotstxt content?
Hold on, I messed around my own code and have fixed it. My code in question:
robot, ok := e.robotsCache[u.Host]
if !ok {
resp, _ := downloader.Client.Get(u.Scheme + "://" + u.Host + "/robots.txt")
robot, err := robotstxt.FromResponse(resp)
if err != nil {
return err
}
e.robotsCache[u.Host] = robot
}
group := robot.FindGroup(e.UserAgent)
The problem here is that robot, err := robotstxt.FromResponse(resp)
declares robot
and err
as new variables in that local scope, while the robot
variable from the cache lookup is now pointing to an invalid address/nil pointer, thus robot.FindGroup
panics.
The solution is to change the line to robot, err = robotstxt.FromResponse(resp)
.
So long story short, my code was rubbish, yours is fine :). I am fairly certain you can close the issue. Thank you for your time.
@narzeja did you also use gocolly?
The robots code within gocolly that causes the same issue:
func (c *Collector) checkRobots(u *url.URL) error {
// var robot *robotstxt.RobotsData
// var ok bool
var err error
c.lock.RLock()
robot, ok := c.robotsMap[u.Host]
c.lock.RUnlock()
if !ok {
// no robots file cached
resp, _ := c.backend.Client.Get(u.Scheme + "://" + u.Host + "/robots.txt")
robot, err = robotstxt.FromResponse(resp)
if err != nil {
return err
}
c.lock.Lock()
c.robotsMap[u.Host] = robot
c.lock.Unlock()
}
uaGroup := robot.FindGroup(c.UserAgent)
if uaGroup == nil {
return nil
}
if !uaGroup.Test(u.EscapedPath()) {
return ErrRobotsTxtBlocked
}
return nil
}
So as narzeja mentions, it's the line robot, err = robotstxt.FromResponse(resp)
causing the issue, rather than using: robot, err := robotstxt.FromResponse(resp)
Maybe I'm missing something but gocolly code seems fine.
robot, ok := map lookup
ok is false!ok + robot, err = robotstxt.FromResponse
only stores robot in cache if err is nilFindGroup
at this point we have valid robot either from cache or from responseThis is different from narzeja code.
@crosscue can you share robots.txt contents that cause trouble with gocolly?
@narzeja beware of map access races, gocolly surrounding cache with lock is correct.
Thank you @temoto, I added RWMutex to my code.
During crawling with https://github.com/gocolly/colly this would periodically be thrown:
I locally modified lines 166-180 of robotstxt.go as follows, and it appears to work okay now: