temoto / robotstxt

The robots.txt exclusion protocol implementation for Go language
MIT License
269 stars 55 forks source link

No error is returned by FromResponse() when the file is incorrect #10

Closed alexander-bauer closed 11 years ago

alexander-bauer commented 11 years ago

I've been working on a project which makes use of this library. I thank you greatly for its usefulness. However, I've been encountering a problem. example.com/robots.txt is, as it turns out, a redirect. Rather than throwing an error, FromResponse() seems to produce an object from which FindGroup().Test() causes a runtime panic.

To illustrate:

package main

import (
    "net/http"
    "github.com/temoto/robotstxt.go"
)

func main() {
    resp, err := http.Get("http://example.com/robots.txt")
    if err != nil {
        //This is not reached.
        println(err.Error())
    }
    robots, err := robotstxt.FromResponse(resp)
    if err != nil {
        //This is also not reached. This is the problem!
        println(err.Error())
    }

    //group will be nil after this line.
    group := robots.FindGroup("MyBot")

    //This will cause a panic.
    print(group.Test("/"))
}
temoto commented 11 years ago

Hello. Thanks for your interest in this package. And thank you for providing a code snippet to reproduce the problem.

Now if you add println("Response:"); resp.Write(os.Stdout) after http.Get and err check, you will see that response is actually 200 OK, this is not redirect. http.Get followed redirect silently and downloaded the final page.

Maybe FindGroup returning nil is a problem indeed. What would you expect?

alexander-bauer commented 11 years ago

Having browsed the code slightly, it looks as if FromBytes() returns a nil under certain circumstances when it fails to parse. I believe that, given these circumstances, (no error message, but an invalid file,) one of two things should happen: either FromResponse() should return nil, err, where err is caught by the if err != nil {} line, or it should return allowAll, err.

If the former approach is taken, then allowAll and disallowAll should be made public; that would simplify the error handling.

//...
//Let's say this returns (nil, err)
robots, err := robotstxt.FromResponse(resp)
if err != nil {
    return robotstxt.AllowAll
}
return robots
temoto commented 11 years ago

Complexity here is that robots.txt is defined pretty loosely, i mean many arbitrary texts can be considered a valid robots.txt file with no rules. I will look into this particular file to determine why it does not return an error.

If you are in big hurry, FromString("") is a shortcut for AllowAll, but yeah, i guess they could be made public.

I'm sorry, i don't have much spare time right now, it may wait until weekend.

temoto commented 11 years ago

@SashaCrofter please see this problem is fixed now. I understand this is not how you wanted it to be fixed, but this is required to be consistent with Google specification. robotstxt.go is compatible with Google's approach since issue #6 "invalid content = no errors, everything is allowed".