gocolly / colly

Elegant Scraper and Crawler Framework for Golang
https://go-colly.org/
Apache License 2.0
23.39k stars 1.77k forks source link

How to ignore cache error responses? #189

Open mazhigali opened 6 years ago

mazhigali commented 6 years ago

Hi why it caches error responses? Did someone solve this problem?

krzysztofantczak commented 6 years ago

Hey @mazhigali

You might want to look here https://github.com/gocolly/colly/issues/187 and here https://github.com/gocolly/colly/pull/188 - just to get some idea of how i see the way it should be handled. As far as i can see it, at the moment it caches every response with response code lower then 500. Which is odd.

vosmith commented 6 years ago

I'm not convinced that responses in the 400's should be retried. These error indicate that there is a problem with on the client side, and continually making such request would put you service at risk of being detected. Caching the actual response may not seem very meaningful, but at least it protects you from exposing your servers. Perhaps it could be handled a different way?

The only code in the 400's that I can think of where a retry would be a good idea is a 429 - Too many requests

mazhigali commented 6 years ago

@vosmith Imagine the situation that I received a response of 403 from 1000 pages and they are cached. The server has already determined that its robot robots and banned my IPs. I already received 40,000 responses with code 200. It would be logical to change IP and start parsing again, so I would not worry the server by requesting 40,000 already worked out pages, but would continue with those who issued 403. Of course, one could solve the situation by checking before with each request there is such url in the database of 40,000 pages, but why such crutches, if it is enough not to cache selective types of server responses.

vosmith commented 6 years ago

@mazhigali. That scenario makes alot of sense. I see your point. I think the work that @krzysztofantczak is doing with the CacheFilter will be a good solution for this situation.

n8henrie commented 5 years ago

It looks like @krzysztofantczak's progress with CacheFilter may have stalled, so I thought I'd share a quick function I whipped up to delete the cache response for a given URL from the cache directory (fixed as .cache here, you'll likely need to change to reflect your code). Based on this, should be fairly easy to plug into your code to delete the cache file in circumstances that you don't want a response cached.

func unCache(URL string) {
        log.Println("Trying to remove cached response for:", URL)
        sum := sha1.Sum([]byte(URL))
        hash := hex.EncodeToString(sum[:])
        dir := path.Join(".cache", hash[:2])
        filename := path.Join(dir, hash)
        log.Println("Deleting cached file:", filename)
        if err := os.Remove(filename); err != nil {
                log.Fatal(err)
        }
}