temoto / robotstxt

The robots.txt exclusion protocol implementation for Go language
MIT License
269 stars 55 forks source link

Incorrect usage of $ symbols #21

Closed lmas closed 5 years ago

lmas commented 5 years ago

Hi, I hit a problem when trying to adhere to https://developer.mozilla.org/robots.txt.

It contains lines like Disallow: /*$history, but this pkg still allows links like /en-US/docs/Web/HTML/Element/blink$history when it shouldn't.

I see you're following the google recommendations and correctly parses and uses the $ as a regexp end symbol, so obviously I blame MDN for incorrect usage of it in the robots file.

Not sure how to handle this and if it's of any relevance to you at all?

temoto commented 5 years ago

I can add exception like use regex only when $ is at word boundary. Also can add manual control whether to use regex. You'd have to add switch based on url coming from certain domain?

On Mon, Oct 15, 2018, 16:27 A. Svensson notifications@github.com wrote:

Hi, I hit a problem when trying to adhere to https://developer.mozilla.org/robots.txt.

It contains lines like Disallow: /*$history, but this pkg still allows links like /en-US/docs/Web/HTML/Element/blink$history when it shouldn't.

I see you're following the google recommendations and correctly parses and uses the $ as a regexp end symbol, so obviously I blame MDN for incorrect usage of it in the robots file.

Not sure how to handle this and if it's of any relevance to you at all?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/temoto/robotstxt/issues/21, or mute the thread https://github.com/notifications/unsubscribe-auth/AABpsaxUuajhJhIhl_DyYwtXmADNKRizks5ulHEVgaJpZM4XcIhJ .

lmas commented 5 years ago

Hmm I don't know. Feels like unneeded complexity for such a nice pkg. Gonna think for a couple of days and see if there's a better way to handle weird robot files.

temoto commented 5 years ago

If you're not making a general tool, but scraping one concrete website, simplest you can do is override robots decision for some known patterns.

On Tue, Oct 16, 2018 at 10:41 PM A. Svensson notifications@github.com wrote:

Hmm I don't know. Feels like unneeded complexity for such a nice pkg. Gonna think for a couple of days and see if there's a better way to handle weird robot files.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/temoto/robotstxt/issues/21#issuecomment-430329963, or mute the thread https://github.com/notifications/unsubscribe-auth/AABpsYej2r8xqgK3g43c2p3D2BQx1yUvks5ulhpAgaJpZM4XcIhJ .

-- Sergey Shepelev Telegram: @temotor https://temoto.ru/ https://github.com/temoto/

lmas commented 5 years ago

Sorry for lack of activity, but yeah I decided to to handle this problem per site, instead of messing around with this lib. Not your fault some sites don't follow the recommendations!

temoto commented 5 years ago

It's not like robots.txt was ever strictly defined, so I wouldn't blame anyone. If you need something changed - please say. Glad you solved the problem.

lmas commented 5 years ago

Nah I think it's just overly complex change to add flags that would mess with the regexp behavior. Feels better to just warn about bad robots.txt instead and let the user decide how to handle it himself.