Closed GoogleCodeExporter closed 9 years ago
See also
http://googlewebmastercentral.blogspot.com/2008/06/improving-on-robots-exclusion
-protocol.html
Original comment by kkrugler...@transpac.com
on 17 Mar 2013 at 6:26
Fuad also reported finding robots.txt files that contain just a query portion:
Disallow: /?field=
Unclear whether that would disallow ANY url that has a query parameter of field
at any position. Something to try with Google's tool?
Original comment by kkrugler...@transpac.com
on 17 Mar 2013 at 6:32
See also
https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
Original comment by kkrugler...@transpac.com
on 17 Mar 2013 at 6:42
Hi kkrugler_lists,
I have tested the situations you mentioned on Google Webmasters Tool, and
sending the results in the attachment as a screenshot.
As you can see, "Disallow: /*.htm$" does NOT match against
"http://domain.com/page.htm?session=23" but matches with
"http://domain.com/page.htm".
And "Disallow: /?field=" or "Disallow: /*?field=" does NOT disallow ANY url
that has a query parameter of "field" at any position. "Disallow: /*?field="
only matches if the "field" parameter is at the beginning of all parameters.
(Ex:"http://crawltest.blogspot.com.tr/page.htm?field=23")
Original comment by alparsla...@gmail.com
on 6 Mar 2014 at 12:35
Attachments:
I want to contribute to the project by implementing this feature. I will make
changes in SimpleRobotRules class and upload a patch as soon as possible.
Do you have any comments to mention?
Original comment by alparsla...@gmail.com
on 6 Mar 2014 at 12:39
Hi alparslanavci,
Thanks for working on this issue - it's important, and something I've put off
for way too long :)
As far as comments go, just put in lots of test cases...the main challenge with
this type of change is avoiding introducing some new problem in the URL
matching logic. Webmasters get _very_ upset if a crawler doesn't obey one of
their rules, even if their robots.txt isn't completely valid.
Original comment by kkrugler...@transpac.com
on 8 Mar 2014 at 1:13
One other comment - it would be good to include issue #17 in the scope of this
change, since that also covers regular expressions in URL patterns.
Original comment by kkrugler...@transpac.com
on 8 Mar 2014 at 1:14
Hi Ken,
I am uploading a patch for both issues, #17 and #18. However, as you can see in
"Robots.txt Directives" in
http://googlewebmastercentral.blogspot.com.tr/2008/06/improving-on-robots-exclus
ion-protocol.html, there is only ($) and (*) wildcard support. I have prepared
the patch according to this and added some tests using "Example path matches"
in https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt.
Would you please check the patch and comment?
Original comment by alparsla...@gmail.com
on 10 Mar 2014 at 12:18
Attachments:
Moreover, I have also made some performance tests and seen that the way I used
in the patch is about three times faster than the regular expression matching.
Original comment by alparsla...@gmail.com
on 10 Mar 2014 at 12:21
I should be able to check this out later today, or tomorrow by the latest.
Original comment by kkrugler...@transpac.com
on 10 Mar 2014 at 2:27
I took a look at the patch.
The hand-rolled 'regex-ish' functionality scares me a bit :)
What was the performance issue with using a regular regex? I note your comment
above, where you say that your approach is 3x faster, but it also seems like
your approach would generate a lot more object allocations (splitting &
substringing lines of text).
Original comment by krugl...@gmail.com
on 11 Mar 2014 at 10:36
Actually, the pattern matching for robots rules contains only ($) and (*)
wildcard support, not all the regex rules. So, we do not need to use regular
regex ops. Moreover; the rule syntax is not the same as regex syntax, so a
parsing (which requires some object allocations) has to be done firstly in
order to use regular regex ops. And lastly; in Java7, String.split() method has
also developed into a non-regular expression way in one-char splitting in
order to make it light-weight. You can see the bug report here:
http://bugs.java.com/view_bug.do?bug_id=6840246. I have also used new split
implementation in the patch.
IMHO, this way is faster and more stable.
Original comment by alparsla...@gmail.com
on 12 Mar 2014 at 8:35
OK, took another look, and I agree with what you say above - it's not a regular
regex, so the hand-rolled version is a good idea.
The one change I'd like to see is avoiding the text.substring() call (and thus
allocations) by using text.indexOf(card, curOffset) and walking the offset
forward...this is in the ruleMatches() method.
I'm happy to make that change, just out of time for that currently.
Original comment by kkrugler...@transpac.com
on 13 Mar 2014 at 3:06
I've gone ahead and made the change I suggested above, and also added sorting
to the rules (this then gets the expected result when checking URLs against the
Amazon.com robots.txt file). Changes made in r113 and r114.
Original comment by kkrugler...@transpac.com
on 13 Mar 2014 at 11:51
Original comment by kkrugler...@transpac.com
on 13 Mar 2014 at 11:52
Thanks for the comment and changes Ken!
Original comment by alparsla...@gmail.com
on 14 Mar 2014 at 7:24
Original issue reported on code.google.com by
kkrugler...@transpac.com
on 17 Mar 2013 at 6:24