respect /robots.txt - Githubissues

rockdaboot / mget

Multithreaded metalink/file/website downloader (like Wget) and C library

GNU Lesser General Public License v3.0

113 stars 19 forks source link

respect /robots.txt #5

Closed rockdaboot closed 11 years ago

rockdaboot commented 11 years ago

Mget should download and respect /robots.txt "Robot Exclusion Standard" and <META name="robots" ...>. More information here: http://en.wikipedia.org/wiki/Robots_exclusion_standard

Thanks to the multithreading nature of Mget, this feature needs a kind of priorization / synchronisation of the downloader threads: robots.txt has to be downloaded and parsed before any other file of a domain...

magemore commented 10 years ago

Hmm... make multithreading over different proxies to bypass site limits, really cool for crawlers... But evil for websites. Also lots of people may want to parse eBay or similar big website, what if combine such crawler with torrent and not to download same pages twice... Like google but with full sources and index inside torrent network. Cool for SEO or technology research.

rockdaboot commented 10 years ago

I wouldn't be evil for websites. It would push technology. AFAIK, websites like eBay's have APIs that should be much faster than brute-force 'download-all-and-than-parse' algorithms. The few technology researchers out there wouldn't cause much aditional traffic, even if they download the "whole internet".

If you want to see support for multiple proxies, open an issue - it shouldn't be too hard to implement such a feature...