temoto / robotstxt

The robots.txt exclusion protocol implementation for Go language
MIT License
269 stars 55 forks source link

Make the `AllowAll` and `DisallowAll` instances public #36

Closed thewizardplusplus closed 4 months ago

thewizardplusplus commented 2 years ago

It may be useful in the following cases:

I suppose it would be more comfortable to use the prepared public instances for allowing and disallowing all links.

temoto commented 2 years ago

You can't just make a variable public, because that allows accidental or malicious mutation outside of package.

Current unit tests check that some agent is allowed on some path. I'm not sure that extracting (dis)allowall is beneficial. You would repeat part of group.test function in your code to check against new value. How is that better?

Other arguments seem to want marshalling to save result of parsing and reuse it later. But what's wrong with just reparsing robots file again? It's pretty fast.

On Sun, Jul 17, 2022, 02:29 thewizardplusplus @.***> wrote:

It may be useful in the following cases:

  • Unit tests. Let's assume we are testing a function that filters a list of links with a previously parsed robots.txt file. How do we check the case when a robots.txt file disallows all links? Now it is possible only with actual parsing of the corresponding robots.txt file.
  • Errors not related to the network. Let's assume we are collecting a cache of parsed robots.txt files. What should we put in it in case of an error? Yes, the library has the useful functions analyze a status of a response, but not all errors are due to network problems (for example, let the cache accepts links as strings and parses them to construct a link to robots.txt files; this parsing may result in an error).
  • Support for temporary ignoring robots.txt files. Let's assume we want to allow a user to ignore robots.txt files through a special option. That is, allow all links. Now again, it is possible only with actual parsing of the corresponding robots.txt file.

I suppose it would be more comfortable to use the prepared public instances for allowing and disallowing all links.

You can view, comment on, or merge this pull request online at:

https://github.com/temoto/robotstxt/pull/36 Commit Summary

File Changes

(1 file https://github.com/temoto/robotstxt/pull/36/files)

Patch Links:

— Reply to this email directly, view it on GitHub https://github.com/temoto/robotstxt/pull/36, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAGTMNSCNET5XHUIQWBAR3VUNAU7ANCNFSM53YZ3ISQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

thewizardplusplus commented 2 years ago

You can't just make a variable public, because that allows accidental or malicious mutation outside of package.

This is common practice in Go, recall io.EOF or http.DefaultClient. The latter is the full equivalent of our case.

But you are right. I can suggest using functions that return these constants, such as NewAllowAll() and NewDisallowAll() (the names are up to you).

Current unit tests check that some agent is allowed on some path. I'm not sure that extracting (dis)allowall is beneficial. You would repeat part of group.test function in your code to check against new value. How is that better?

Other arguments seem to want marshalling to save result of parsing and reuse it later. But what's wrong with just reparsing robots file again? It's pretty fast.

You are right, I can parse prepared strings every time I need those special cases. But I find it non-ideomatic: I need the special value, the library has that value, but private, and I have to parse the same string over and over again.

After all, why does the library have these constants? You could also parse prepared strings each time to get AllowAll and DisallowAll. That's fast. But you chose to use constants. Because it's obviously more comfortable (and faster).

temoto commented 2 years ago

After all, why does the library have these constants?

To save memory.

Library can't save cpu/time because it needs to prove that incoming text is equivalent to popular constant.

These constants do not have special meaning. They're just cache of common inputs.

On Mon, Jul 18, 2022, 12:31 thewizardplusplus @.***> wrote:

You can't just make a variable public, because that allows accidental or malicious mutation outside of package.

This is common practice in Go, recall io.EOF or http.DefaultClient. The latter is the full equivalent of our case.

But you are right. I can suggest using functions that return these constants, such as NewAllowAll() and NewDisallowAll() (the names are up to you).

Current unit tests check that some agent is allowed on some path. I'm not sure that extracting (dis)allowall is beneficial. You would repeat part of group.test function in your code to check against new value. How is that better?

Other arguments seem to want marshalling to save result of parsing and reuse it later. But what's wrong with just reparsing robots file again? It's pretty fast.

You are right, I can parse prepared strings every time I need those special cases. But I find it non-ideomatic: I need the special value, the library has that value, but private, and I have to parse the same string over and over again.

After all, why does the library have these constants? You could also parse prepared strings each time to get AllowAll and DisallowAll. That's fast. But you chose to use constants. Because it's obviously more comfortable (and faster).

— Reply to this email directly, view it on GitHub https://github.com/temoto/robotstxt/pull/36#issuecomment-1186976180, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAGTMMCS3AJ27Y3FI4CV2LVUUP6ZANCNFSM53YZ3ISQ . You are receiving this because you commented.Message ID: @.***>

temoto commented 2 years ago

I can parse prepared strings every time I need those special cases. But I find it non-ideomatic: I need the special value

This is where we don't agree yet.

Library doesn't use (dis)allowAll constants as special value. Maybe RobotsData.(Dis)AllowAll bool should be exported. Maybe via method RobotsData.SetAll(bool).

Let's imagine RobotsData supports Marshaler interface and result is stable between Go versions, like protobuf. Will that be ok or would you still want SetAll ?

codecov[bot] commented 2 years ago

Codecov Report

Merging #36 (60a6ab8) into master (a68aeca) will not change coverage. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master      #36   +/-   ##
=======================================
  Coverage   92.59%   92.59%           
=======================================
  Files           3        3           
  Lines         405      405           
=======================================
  Hits          375      375           
  Misses         19       19           
  Partials       11       11           
Impacted Files Coverage Δ
robotstxt.go 92.85% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update a68aeca...60a6ab8. Read the comment docs.