temoto / robotstxt

The robots.txt exclusion protocol implementation for Go language
MIT License
269 stars 55 forks source link

TestAgent and Test (for the same user-agent) gives different results in case of temporary error when fetching the robots.txt file #40

Open masonlouchart opened 11 months ago

masonlouchart commented 11 months ago
diff --git a/robotstxt_test.go b/robotstxt_test.go
index 6ccb730..6cbda57 100644
--- a/robotstxt_test.go
+++ b/robotstxt_test.go
@@ -291,3 +291,38 @@ func newHttpResponse(code int, body string) *http.Response {
        ContentLength: int64(len(body)),
    }
 }
+
+func TestDisallowAll(t *testing.T) {
+   r, err := FromStatusAndBytes(500, nil) // We got a 500 response => Disallow all
+   require.NoError(t, err)
+
+   a := r.TestAgent("/", "*")
+   assert.False(t, a) // Resource access NOT allowed (EXPECTED)
+
+   b := r.FindGroup("*").Test("/")
+   assert.True(t, b) // Resource access allowed (UNEXPECTED)
+
+   assert.Equal(t, a, b) // Results for test on Agent and Group are differents...
+
+   /*
+       It's because the `disallowAll` is checked by `TestAgent` but not `Test`.
+
+       Because `TestAgent` also calls `FindGroup` internally but obfuscates the
+       value of `CrawlDelay`, users of this library might prefer to use
+       (`FindGroup` + `Test`) to have access to the `CrawlDelay` value in case the
+       path is allowed.
+
+       FindGroup -> Test (ok) -> check CarwlDelay
+
+       Unfortunately, the `Test` method does not use the `disallowAll` member set
+       on response with status in the range [500; 599]. This behavior is unexpected
+       and can lead to involuntary politeness policy violation.
+
+       Unless we resign to call `TestAgent` and `FindGroup` to get the `CrawlDelay`
+       value.
+
+       TestAgent (ok) -> FindGroup -> check CrawlDelay
+
+       This way, `FindGroup` has been called twice.
+       Is there a way to avoid it without risking politeness policy violation?
+   */
+}

Run:

go test ./... -run TestDisallowAll