opsdisk / metagoofil

Search Google and download specific file types
Other
405 stars 85 forks source link

doesnt play nicely when it receives a status 429 #21

Closed sbrun closed 3 years ago

sbrun commented 3 years ago

Hello, When you run this command(in kali): metagoofil -d https://sans.org -t doc,pdf,xls -l 200 -o sans_files -f It fails instead of correctly handling this exception: urllib.error.HTTPError: HTTP Error 429: Too Many Requests

Issue was first reported here: https://bugs.kali.org/view.php?id=7005

opsdisk commented 3 years ago

Hi @sbrun - Thank you for taking the time to submit this issue. The HTTP 429 is because Google rightfully thinks the script is a bot and is throttling the searches for your IP, so the exception looks correct.

From https://bugs.kali.org/view.php?id=7005

It SHOULD deal with the 429 gracefully and back off the request rate a bit."

So are you requesting backoff logic? I've played around with some, but it's hard to know how much time "for the server to get out of it's grumpy mood".

You're better off increasing the delay (through -e) at the cost of taking longer to run or running the script through a bank of proxies. Another one of my tools, pagodo, encounters the same issues and that's basically what I recommend:

https://github.com/opsdisk/pagodo/blob/master/pagodo.py#L144

As for the metadata extraction, this was my stance on it: https://github.com/opsdisk/metagoofil#metadata-extraction

sbrun commented 3 years ago

Hi, I don't know what is the best solution but I think it should not fail with a Python error. It looks like there is an error / bug in the script for the user and he is without any clue to solve it. Maybe you can catch the error and add a comment as you have done in pagodo?

For the metadata extraction it's not an issue from my point of view as you clearly decide to not keep this feature in the tool.

opsdisk commented 3 years ago

Fixed in https://github.com/opsdisk/metagoofil/pull/22