lc / gau

Fetch known URLs from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl.
MIT License
3.75k stars 430 forks source link

Regarding feat: implement unthrottled concurrency using task queue #141

Open wumpus opened 1 month ago

wumpus commented 1 month ago

Can you stop attacking the Common Crawl CDX API?

lc commented 1 month ago

I’m not? This is an open source tool to find archived URLs for a given domain…

wumpus commented 1 month ago

Yes, and because it isn't throttled, use of this package harms the target, which is me.

wumpus commented 1 month ago

Any progress? I was hoping for rate limiting, honoring 503 and 429 status codes, and exponential backoff.

And not just "unthrottled concurrency".

lc commented 1 month ago

It’s open source, so PR's are welcome.

It is going to be a busy month with some life changes for me – I will put this in my TODO's. Unfortunately will likely not get done until late June or early July

lc commented 1 month ago

Accidentally closed when commenting

wumpus commented 1 month ago

Thanks for adding to your TODO list, I appreciate it!

Here's an example of making a single query in Athena that's much more efficient than gau: https://positive.security/blog/ransack-data-exfiltration#common-crawl

lc commented 1 month ago

Thanks for the reference & sorry about the slowness to implement. Getting hitched!

wumpus commented 1 month ago

Congratulations!