Open mazhigali opened 6 years ago
Hey @mazhigali
You might want to look here https://github.com/gocolly/colly/issues/187 and here https://github.com/gocolly/colly/pull/188 - just to get some idea of how i see the way it should be handled. As far as i can see it, at the moment it caches every response with response code lower then 500. Which is odd.
I'm not convinced that responses in the 400's should be retried. These error indicate that there is a problem with on the client side, and continually making such request would put you service at risk of being detected. Caching the actual response may not seem very meaningful, but at least it protects you from exposing your servers. Perhaps it could be handled a different way?
The only code in the 400's that I can think of where a retry would be a good idea is a 429 - Too many requests
@vosmith Imagine the situation that I received a response of 403 from 1000 pages and they are cached. The server has already determined that its robot robots and banned my IPs. I already received 40,000 responses with code 200. It would be logical to change IP and start parsing again, so I would not worry the server by requesting 40,000 already worked out pages, but would continue with those who issued 403. Of course, one could solve the situation by checking before with each request there is such url in the database of 40,000 pages, but why such crutches, if it is enough not to cache selective types of server responses.
@mazhigali. That scenario makes alot of sense. I see your point. I think the work that @krzysztofantczak is doing with the CacheFilter
will be a good solution for this situation.
It looks like @krzysztofantczak's progress with CacheFilter may have stalled, so I thought I'd share a quick function I whipped up to delete the cache response for a given URL from the cache directory (fixed as .cache
here, you'll likely need to change to reflect your code). Based on this, should be fairly easy to plug into your code to delete the cache file in circumstances that you don't want a response cached.
func unCache(URL string) {
log.Println("Trying to remove cached response for:", URL)
sum := sha1.Sum([]byte(URL))
hash := hex.EncodeToString(sum[:])
dir := path.Join(".cache", hash[:2])
filename := path.Join(dir, hash)
log.Println("Deleting cached file:", filename)
if err := os.Remove(filename); err != nil {
log.Fatal(err)
}
}
Hi why it caches error responses? Did someone solve this problem?