Closed lmas closed 5 years ago
I can add exception like use regex only when $
is at word boundary.
Also can add manual control whether to use regex. You'd have to add switch
based on url coming from certain domain?
On Mon, Oct 15, 2018, 16:27 A. Svensson notifications@github.com wrote:
Hi, I hit a problem when trying to adhere to https://developer.mozilla.org/robots.txt.
It contains lines like Disallow: /*$history, but this pkg still allows links like /en-US/docs/Web/HTML/Element/blink$history when it shouldn't.
I see you're following the google recommendations and correctly parses and uses the $ as a regexp end symbol, so obviously I blame MDN for incorrect usage of it in the robots file.
Not sure how to handle this and if it's of any relevance to you at all?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/temoto/robotstxt/issues/21, or mute the thread https://github.com/notifications/unsubscribe-auth/AABpsaxUuajhJhIhl_DyYwtXmADNKRizks5ulHEVgaJpZM4XcIhJ .
Hmm I don't know. Feels like unneeded complexity for such a nice pkg. Gonna think for a couple of days and see if there's a better way to handle weird robot files.
If you're not making a general tool, but scraping one concrete website, simplest you can do is override robots decision for some known patterns.
On Tue, Oct 16, 2018 at 10:41 PM A. Svensson notifications@github.com wrote:
Hmm I don't know. Feels like unneeded complexity for such a nice pkg. Gonna think for a couple of days and see if there's a better way to handle weird robot files.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/temoto/robotstxt/issues/21#issuecomment-430329963, or mute the thread https://github.com/notifications/unsubscribe-auth/AABpsYej2r8xqgK3g43c2p3D2BQx1yUvks5ulhpAgaJpZM4XcIhJ .
-- Sergey Shepelev Telegram: @temotor https://temoto.ru/ https://github.com/temoto/
Sorry for lack of activity, but yeah I decided to to handle this problem per site, instead of messing around with this lib. Not your fault some sites don't follow the recommendations!
It's not like robots.txt was ever strictly defined, so I wouldn't blame anyone. If you need something changed - please say. Glad you solved the problem.
Nah I think it's just overly complex change to add flags that would mess with the regexp behavior. Feels better to just warn about bad robots.txt instead and let the user decide how to handle it himself.
Hi, I hit a problem when trying to adhere to https://developer.mozilla.org/robots.txt.
It contains lines like
Disallow: /*$history
, but this pkg still allows links like/en-US/docs/Web/HTML/Element/blink$history
when it shouldn't.I see you're following the google recommendations and correctly parses and uses the
$
as a regexp end symbol, so obviously I blame MDN for incorrect usage of it in the robots file.Not sure how to handle this and if it's of any relevance to you at all?