Closed mna closed 12 years ago
Also, diving a little deeper into the library's code, I'm not sure why the Rule type is public, while the RobotsData' rules slice is private, so there is no way to get to the rules, so this is only if you would want to use the parser directly? This is obviously your call to make, but I would leave Parser, ByteScanner and Rule private, and offer only the FromXxx and TestXxx methods to the outside world (MatchAgent and MatchRule and String would be private, since Rule would be too).
And another point, the Test and TestAgent signatures return an error as second value, but there is no code path that returns something else than nil for the error. I'm preparing a pull request that removes this error return value, as well as takes care of my prior comment (hopefully - not done yet!).
EDIT: The Test() method has a path that returns an error, if the DefaultAgent is empty, but since this same check is not done on TestAgent (empty agent does not raise an error), I would remove this check. Caller's responsibility, and anyway it does match the * rules, so this is ok behaviour.
Thanks!
Man, you put a lot of effort into it, thank you very much.
About exported fields and parser - i come from Python background and i value possibility of easy access to internals of used libraries, had that in mind when those names were public, rules
slice should be so too. Then i discovered this is not idiomatic in Go and it is still easy to play with internals by modifying library sources, so your change is probably better.
On error
second value - now that you made scanner/parser to return slice of errors, perhaps, print_errors is unnecessary, could just return them all. Perhaps, by squashing into one error. I agree on callers responsibility, the check actually was done as kind of protection against wrong caller code and now i agree it was a bad idea.
What do you think on making RobotsData.sitemaps
public?
I'm going to play with it (maybe few days <- have a job) to get a feeling of API changes.
Thanks for the quick feedback, I thought about it yesterday, I completely forgot to make Sitemaps public and add a function to get crawl delay based on user agent. I'll add this while you take the time to go over the changes.
I agree with you for the print_errors, could return an error with a slice of messages in it or something. On Oct 1, 2012 4:41 AM, "Sergey Shepelev" notifications@github.com wrote:
Man, you put a lot of effort into it, thank you very much.
About exported fields and parser - i come from Python background and i value possibility of easy access to internals of used libraries, had that in mind when those names were public, rules slice should be so too. Then i discovered this is not idiomatic in Go and it is still easy to play with internals by modifying library sources, so your change is probably better.
On error second value - now that you made scanner/parser to return slice of errors, perhaps, print_errors is unnecessary, could just return them all. Perhaps, by squashing into one error. I agree on callers responsibility, the check actually was done as kind of protection against wrong caller code and now i agree it was a bad idea.
What do you think on making RobotsData.sitemaps public?
I'm going to play with it (maybe few days <- have a job) to get a feeling of API changes.
— Reply to this email directly or view it on GitHubhttps://github.com/temoto/robotstxt.go/issues/6#issuecomment-9024967.
I gave it a shot, and i'm sorry it was kinda stupid to delay this, it should've been obvious from sources that the only thing that really changed is absence of second error
in TestAgent
. Looks great, thank you again.
But since it changes public API, before merging, i think maybe we could take into account possible other API changes (following)?
RobotsData.Test
seems redundant, a highly representative set of 2 projects :) does not use it.findGroup()
. This may sound like premature optimization but i'm kinda aiming for efficiency so unless this is silly, here are possible solutions to avoid second full scan of parsed groups:
TestAgent()
return Allow bool and CrawlDelay
or whole group
(for possible new fields)FindGroup()
to return public Group
with public CrawlDelay
and Test(url)
methodSetGroup()
would find group and set internal pointer to it for later reuse by TestAgent()
.I agree with just about everything you said:
RobotsData.Test
seems redundant, a highly representative set of 2 projects :) does not use it
Make that 3 projects :) I don't use it in my gocrawl project either.
- Personally, i like second option most, although it's most verbose, but is clear and efficient.
Same here. I think func FindGroup(agent string) Group
would be a good way to save multiple searches when many paths must be tested for a same agent (and/or the crawl delay is needed). Though I guess I'd leave a TestAgent(path, agent string) bool
for those cases that want a quick and dirty check, no optimizations required and no interest in crawl-delay (which would obviously call FindGroup() internally).
(this is really minor) i think it would be idiomatic and friendly to export CrawlDelay as time.Duration instead of float.
Oh man you're so right, I hate the fact that I didn't think of it. Guess I need more Go-immersion :) That's the kind of minor detail I love to get right.
As a final thought, how about a FromResponse(*http.Response)
? It could replace the current FromResponse()
, which could be renamed as FromResponseContent()
or something? I kind of dislike having to read the body before the call. It would be clean and simple from a library user standpoint.
Do you want me to go ahead and make those changes? I think the pull request is "live", in that if I push to this branch it will be reflected in the PR.
Yes, if you don't mind, please do the changes. Don't worry about updates in PR.
P.S.: i already counted gocrawl as second :)
Closed pull request, because i spoiled it, please ignore this (hopefully) little inconvenience. Your patches will make it into repo with authorship untouched.
I added the changes discussed into the issue6 branch. Do you want me to re-submit the PR?
EDIT: Ok, just saw your comment in the closed PR, I understand this would not work anyway now that master has changed. Anyway, everything should be there, let me know how it goes.
Merged. Thank you for all your hard work.
Thanks. Final note as a reminder: you'll want to update the readme eventually. I didn't do much on that front.
Yes, i'm finishing it, and meanwhile wanted to change a few other things while i'm at it. What do you think on this new logic? (status codes <200 and 3xx yield error instead of disallowAll)
func FromStatusAndBytes(statusCode int, body []byte) (*RobotsData, error) {
switch {
case statusCode >= 200 && statusCode < 300:
return FromBytes(body)
// [comment]
case statusCode >= 400 && statusCode < 500:
return allowAll, nil
// [comment]
case statusCode >= 500:
return disallowAll, nil
}
return nil, errors.New("Unexpected status: " + strconv.FormatInt(int64(statusCode), 10))
}
Seems reasonable given how unlikely this is. In this case you may want to limit the 5xx to < 600, so that funky web server responses above 599 get the error too (undefined in HTTP spec, but still possible).
On 2012-10-02, at 9:24 PM, Sergey Shepelev notifications@github.com wrote:
Yes, i'm finishing it, and meanwhile wanted to change a few other things while i'm at it. What do you think on this new logic? (status codes <200 and 3xx yield error instead of disallowAll)
func FromStatusAndBytes(statusCode int, body []byte) (*RobotsData, error) { switch { case statusCode >= 200 && statusCode < 300: return FromBytes(body)
// [comment] case statusCode >= 400 && statusCode < 500: return allowAll, nil
// [comment] case statusCode >= 500: return disallowAll, nil }
return nil, errors.New("Unexpected status: " + strconv.FormatInt(int64(statusCode), 10)) } — Reply to this email directly or view it on GitHub.
Thank you for feedback.
Based on my readings on robots.txt, only one group (
User-agent:
line followed by one or many allow/disallow lines) can be applied for a given user-agent string. From google's Robots.txt specs:This is coherent with examples on robotstxt.org:
To allow a single robot
This case doesn't work as expected with the library (
TestAgent("Google", "/")
returns false). From a quick glance at the code, this would be because it keeps looking for matching rules (regardless of specificity of match - i.e. "*" or a lengthy part of the user-agent string) until a disallow is found.I will try to send a pull request your way in the coming days.
Resources:
http://www.robotstxt.org/robotstxt.html http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156449 https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt