Groups are not handled correctly

mna commented 12 years ago

Based on my readings on robots.txt, only one group (User-agent: line followed by one or many allow/disallow lines) can be applied for a given user-agent string. From google's Robots.txt specs:

"Only one group of group-member records is valid for a particular crawler. The crawler must determine the correct group of records by finding the group with the most specific user-agent that still matches. All other groups of records are ignored by the crawler. The user-agent is non-case-sensitive."

This is coherent with examples on robotstxt.org:

To allow a single robot

User-agent: Google
Disallow:

User-agent: *
Disallow: /

This case doesn't work as expected with the library (TestAgent("Google", "/") returns false). From a quick glance at the code, this would be because it keeps looking for matching rules (regardless of specificity of match - i.e. "*" or a lengthy part of the user-agent string) until a disallow is found.

I will try to send a pull request your way in the coming days.

Resources:

http://www.robotstxt.org/robotstxt.html http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt

mna commented 12 years ago

Also, diving a little deeper into the library's code, I'm not sure why the Rule type is public, while the RobotsData' rules slice is private, so there is no way to get to the rules, so this is only if you would want to use the parser directly? This is obviously your call to make, but I would leave Parser, ByteScanner and Rule private, and offer only the FromXxx and TestXxx methods to the outside world (MatchAgent and MatchRule and String would be private, since Rule would be too).

And another point, the Test and TestAgent signatures return an error as second value, but there is no code path that returns something else than nil for the error. I'm preparing a pull request that removes this error return value, as well as takes care of my prior comment (hopefully - not done yet!).

EDIT: The Test() method has a path that returns an error, if the DefaultAgent is empty, but since this same check is not done on TestAgent (empty agent does not raise an error), I would remove this check. Caller's responsibility, and anyway it does match the * rules, so this is ok behaviour.

Thanks!

temoto commented 12 years ago

Man, you put a lot of effort into it, thank you very much.

About exported fields and parser - i come from Python background and i value possibility of easy access to internals of used libraries, had that in mind when those names were public, rules slice should be so too. Then i discovered this is not idiomatic in Go and it is still easy to play with internals by modifying library sources, so your change is probably better.

On error second value - now that you made scanner/parser to return slice of errors, perhaps, print_errors is unnecessary, could just return them all. Perhaps, by squashing into one error. I agree on callers responsibility, the check actually was done as kind of protection against wrong caller code and now i agree it was a bad idea.

What do you think on making RobotsData.sitemaps public?

I'm going to play with it (maybe few days <- have a job) to get a feeling of API changes.

mna commented 12 years ago

Thanks for the quick feedback, I thought about it yesterday, I completely forgot to make Sitemaps public and add a function to get crawl delay based on user agent. I'll add this while you take the time to go over the changes.

I agree with you for the print_errors, could return an error with a slice of messages in it or something. On Oct 1, 2012 4:41 AM, "Sergey Shepelev" notifications@github.com wrote:

Man, you put a lot of effort into it, thank you very much.

About exported fields and parser - i come from Python background and i value possibility of easy access to internals of used libraries, had that in mind when those names were public, rules slice should be so too. Then i discovered this is not idiomatic in Go and it is still easy to play with internals by modifying library sources, so your change is probably better.

On error second value - now that you made scanner/parser to return slice of errors, perhaps, print_errors is unnecessary, could just return them all. Perhaps, by squashing into one error. I agree on callers responsibility, the check actually was done as kind of protection against wrong caller code and now i agree it was a bad idea.

What do you think on making RobotsData.sitemaps public?

I'm going to play with it (maybe few days <- have a job) to get a feeling of API changes.

— Reply to this email directly or view it on GitHubhttps://github.com/temoto/robotstxt.go/issues/6#issuecomment-9024967.

temoto commented 12 years ago

I gave it a shot, and i'm sorry it was kinda stupid to delay this, it should've been obvious from sources that the only thing that really changed is absence of second error in TestAgent. Looks great, thank you again.

But since it changes public API, before merging, i think maybe we could take into account possible other API changes (following)?

RobotsData.Test seems redundant, a highly representative set of 2 projects :) does not use it
separate function to get crawl delay would perform same .findGroup(). This may sound like premature optimization but i'm kinda aiming for efficiency so unless this is silly, here are possible solutions to avoid second full scan of parsed groups:
- TestAgent() return Allow bool and CrawlDelay or whole group (for possible new fields)
- Export FindGroup() to return public Group with public CrawlDelay and Test(url) method
- Separate call SetGroup() would find group and set internal pointer to it for later reuse by TestAgent().
- Personally, i like second option most, although it's most verbose, but is clear and efficient. Second and third options also cut costs of checking same robots.txt against same useragent and new urls, which is typical behavior.
(this is really minor) i think it would be idiomatic and friendly to export CrawlDelay as time.Duration instead of float.

mna commented 12 years ago

I agree with just about everything you said:

RobotsData.Test seems redundant, a highly representative set of 2 projects :) does not use it

Make that 3 projects :) I don't use it in my gocrawl project either.

Personally, i like second option most, although it's most verbose, but is clear and efficient.

Same here. I think func FindGroup(agent string) Group would be a good way to save multiple searches when many paths must be tested for a same agent (and/or the crawl delay is needed). Though I guess I'd leave a TestAgent(path, agent string) bool for those cases that want a quick and dirty check, no optimizations required and no interest in crawl-delay (which would obviously call FindGroup() internally).

(this is really minor) i think it would be idiomatic and friendly to export CrawlDelay as time.Duration instead of float.

Oh man you're so right, I hate the fact that I didn't think of it. Guess I need more Go-immersion :) That's the kind of minor detail I love to get right.

As a final thought, how about a FromResponse(*http.Response)? It could replace the current FromResponse(), which could be renamed as FromResponseContent() or something? I kind of dislike having to read the body before the call. It would be clean and simple from a library user standpoint.

Do you want me to go ahead and make those changes? I think the pull request is "live", in that if I push to this branch it will be reflected in the PR.

temoto commented 12 years ago

Yes, if you don't mind, please do the changes. Don't worry about updates in PR.

P.S.: i already counted gocrawl as second :)

temoto commented 12 years ago

Closed pull request, because i spoiled it, please ignore this (hopefully) little inconvenience. Your patches will make it into repo with authorship untouched.

mna commented 12 years ago

I added the changes discussed into the issue6 branch. Do you want me to re-submit the PR?

EDIT: Ok, just saw your comment in the closed PR, I understand this would not work anyway now that master has changed. Anyway, everything should be there, let me know how it goes.

temoto commented 12 years ago

Merged. Thank you for all your hard work.

mna commented 12 years ago

Thanks. Final note as a reminder: you'll want to update the readme eventually. I didn't do much on that front.

temoto commented 12 years ago

Yes, i'm finishing it, and meanwhile wanted to change a few other things while i'm at it. What do you think on this new logic? (status codes <200 and 3xx yield error instead of disallowAll)

func FromStatusAndBytes(statusCode int, body []byte) (*RobotsData, error) {
switch {
case statusCode >= 200 && statusCode < 300:
    return FromBytes(body)

// [comment]
case statusCode >= 400 && statusCode < 500:
    return allowAll, nil

// [comment]
case statusCode >= 500:
    return disallowAll, nil
}

return nil, errors.New("Unexpected status: " + strconv.FormatInt(int64(statusCode), 10))
}

mna commented 12 years ago

Seems reasonable given how unlikely this is. In this case you may want to limit the 5xx to < 600, so that funky web server responses above 599 get the error too (undefined in HTTP spec, but still possible).

On 2012-10-02, at 9:24 PM, Sergey Shepelev notifications@github.com wrote:

Yes, i'm finishing it, and meanwhile wanted to change a few other things while i'm at it. What do you think on this new logic? (status codes <200 and 3xx yield error instead of disallowAll)

func FromStatusAndBytes(statusCode int, body []byte) (*RobotsData, error) { switch { case statusCode >= 200 && statusCode < 300: return FromBytes(body)

// [comment] case statusCode >= 400 && statusCode < 500: return allowAll, nil

// [comment] case statusCode >= 500: return disallowAll, nil }

return nil, errors.New("Unexpected status: " + strconv.FormatInt(int64(statusCode), 10)) } — Reply to this email directly or view it on GitHub.

temoto commented 12 years ago

Thank you for feedback.

temoto / robotstxt

Groups are not handled correctly #6