o0111 / ruralcafe

Automatically exported from code.google.com/p/ruralcafe
0 stars 0 forks source link

Automatically remove 404 pages #8

Closed GoogleCodeExporter closed 8 years ago

GoogleCodeExporter commented 8 years ago
The search results page from RuralCafe sometimes returns 404 pages, these 
should be filtered out and removed from the Lucene index.

The 404 pages could possibly also blacklisted (in only the crawler) since 
they're obviously unable to be cached. The only thing about blacklisting them 
completely is that if the dynamic caching is better later, the pages could be 
re-included so they should probably be differentiated somehow if this is to be 
done.

Original issue reported on code.google.com by shouldab...@gmail.com on 3 Oct 2010 at 8:53

GoogleCodeExporter commented 8 years ago
I cannot reproduce this.

If the remote proxy encounters 404 (or any other error), the status will be 
Failed and nothing will be put into the cache. So if google finds a 404 page, 
and the user tries to download it, it will not be cached.

Therefore no 404 pages should ever be in the cache, and the local search should 
not find any 404 pages.

I will close this issue. If I am missing something, please tell me.

Original comment by satiaher...@gmx.de on 5 Jul 2013 at 12:01