Prevent google crawling, or make it faster.

jasononeil commented 9 years ago

We just had a google bot start crawling "preview.lib.haxe.org". (Still not sure where it scraped the URL from, but oh well).

It hit the File Browser, which currently displays a source file by opening the haxelib zip, unpacking the file, rendering it, and sending it to the client. Needless to say, with the tens (hundreds?) of thousands of files, this was causing significant strain on the server.

I've turned the preview site off for now until I fix this, either by having a faster (cached?) implementation, or by using robots.txt to block google from the file browser section.

markknol commented 9 years ago

Ah that's odd. Google reads our mail! Since most content is static per lib version, you might just render out the stuff once (if it doesn't exist yet) to plain html, store that in cache or on disk and serve that?

jasononeil commented 9 years ago

Yes I think that's a suitable solution for text files, we could cache them in the DB. Images and binaries we can perhaps block from web crawlers, as they won't be valuable to search results and are not might be too large to suitably cache, especially some of the ndll files etc.

On Wed, Mar 18, 2015 at 9:10 PM, Mark Knol notifications@github.com wrote:

Since most content is static per lib version, you might just render out the stuff once (if it doesn't exist yet) to plain html, store that in cache or on disk and serve that?

— Reply to this email directly or view it on GitHub https://github.com/jasononeil/haxelib/issues/24#issuecomment-82963591.

markknol commented 9 years ago

What is the state of this?

jasononeil / haxelib

Prevent google crawling, or make it faster. #24