Ignore decode errors when parsing files

jokki commented 4 years ago

Background for PR: Suppose you're testing a site that has a CAPTCHA that to some extent blocks tools like SubDomainizer from getting any useful responses. My workaround for this situation so far has been to manually browse the site with an intercepting proxy like ZAP or BurpSuite and manually click myself past the CAPTCHAs. Once I'm happy with my exploration of the site I'll dump all the response data to file and run python3 SubDomainizer.py -f <dir with files> on it. That initially failed for me because the response data has binary content that breaks "file.read()" ('t' flag/text). I added errors='ignore' to it to resolve that problem.

If you've got a better way or any thoughts or ideas about this I'd love to hear it. Thanks for making this tool!

Commit comment: This change only applies to when using the "-f" argument for parsing files in a directory. Suppose you have some binary data in your file(s) you will then end up with decoding errors when reading the file(s). Ignoring errors lets you move past the binary data and continue processing the remainder of the file(s).

nsonaniya2010 commented 4 years ago

Thanks for creating a pull request. Did you got some examples of responses for which you're getting responses. Also how you're getting a binary file?

jokki commented 4 years ago

It could be for example PDFs or images. I think I've seen font data, like "woff2" (also binary) somewhere too...

nsonaniya2010 commented 4 years ago

Yeah I am talking about that only. How're you getting that data? I am neither downloading pdf nor fonts, but only files inside script tag?

jokki commented 4 years ago

Ah, I see. I've come to pass all my response data from ZAP/Burp to 'SubDomainzer.py -f ' because if I just set 'SubDomainzer' on a site with a CAPTCHA then the responses returned to 'SubDomainizer' won't be the actual site contents. And either way, just letting 'SubDomainzer' work through the data from a browsing session, looking for subdomains and secrets seems like a good idea to me...

nsonaniya2010 commented 4 years ago

Ah got it. I still doubt on the decoding thing. Although merging your pull request.

Thanks for that.

nsonaniya2010 / SubDomainizer

Ignore decode errors when parsing files #27