hakluke / hakrawler

Simple, fast web crawler designed for easy, quick discovery of endpoints and assets within a web application
https://hakluke.com
GNU General Public License v3.0
4.49k stars 497 forks source link

Scope feature #97

Closed nullenc0de closed 3 years ago

nullenc0de commented 3 years ago

Love the new version. Would be awesome if a -scope flag was added. Crawling some sites can get out of hand quickly.

Pretty much any site I crawl results in more Facebook or Google crawling than expected.

hakluke commented 3 years ago

Yeah I've been thinking about this. Instead of adding a "scope" feature - how would you feel about just limiting to the current domain?

hakluke commented 3 years ago

I can't think of a use-case for basically crawling the entire internet...

nullenc0de commented 3 years ago

There are a few ways I am using it right now. 1) cat allsubdomains.txt | hakrawler 2) echo subdomain | hakcrawler 3) cat all_subs_from_all_domains.txt | hakcrawler

Of course, implementing a "scope" flag would be best case scenario for me, personally. That way I can say something like *.root_domain or maybe a txt file with all the domains/subs that I want.

But limiting it to the current domain is certainly easier from a developer perspective. However you choose, it would be really useful. Like you said, if I give hakrawler a depth of more than 2 I'm going to crawling random sites.

hakluke commented 3 years ago

hmmmmm 🤔 leave it with me

hakluke commented 3 years ago

Okay so what I've done is limited the scope of the crawling to whatever hostname is in the URL you provided. If you provide multiple URLs in a file, it will take the hostname from whichever URL is the current one that is being crawled. I think this is way more sane. If someone wanted to crawl through hosts recursively, they could just take the output of hakrawler and pipe it back into itself recursively :)