Closed nullenc0de closed 3 years ago
Yeah I've been thinking about this. Instead of adding a "scope" feature - how would you feel about just limiting to the current domain?
I can't think of a use-case for basically crawling the entire internet...
There are a few ways I am using it right now. 1) cat allsubdomains.txt | hakrawler 2) echo subdomain | hakcrawler 3) cat all_subs_from_all_domains.txt | hakcrawler
Of course, implementing a "scope" flag would be best case scenario for me, personally. That way I can say something like *.root_domain or maybe a txt file with all the domains/subs that I want.
But limiting it to the current domain is certainly easier from a developer perspective. However you choose, it would be really useful. Like you said, if I give hakrawler a depth of more than 2 I'm going to crawling random sites.
hmmmmm 🤔 leave it with me
Okay so what I've done is limited the scope of the crawling to whatever hostname is in the URL you provided. If you provide multiple URLs in a file, it will take the hostname from whichever URL is the current one that is being crawled. I think this is way more sane. If someone wanted to crawl through hosts recursively, they could just take the output of hakrawler and pipe it back into itself recursively :)
Love the new version. Would be awesome if a -scope flag was added. Crawling some sites can get out of hand quickly.
Pretty much any site I crawl results in more Facebook or Google crawling than expected.