lc / gau

Fetch known URLs from AlienVault's Open Threat Exchange, the Wayback Machine, and Common Crawl.
MIT License
3.99k stars 450 forks source link

Richer JSON output #89

Open ocervell opened 1 year ago

ocervell commented 1 year ago

Would be nice to have some other response data than just the URL in the JSON output, such as :

{ "url": "https://test.domain.synology.me/.htaccess-local", "status_code": 200, "words": 1066, "lines": 100, "content_length": 4516, "content_type": "text/html; charset=utf-8", "duration": 57779116, "host": "test.domain.synology.me" }

That would avoid scraping the endpoint again to find those details.

Maybe even consider using httpx as a client instead of fasthttp as it seems to give more info on the response ?

lc commented 1 year ago

gau is completely passive at the moment. It issues no HTTP requests to URLs that are archived from Wayback, OTX, etc. It can be piped into a tool such as httpx for additional info. Would you prefer that gau had an option for this instead?

ocervell commented 1 year ago

Ah, I thought since there is a --mc strings # list of status codes to match option that there was still some crawling happening. What is the --mc flag purpose then ? Otherwise an option for adding an httpx query could be done, even though we would not really control httpx input options like tech detection and so on ...

zerodivisi0n commented 1 year ago

I think it is useful to add provider, timestamp, status_code, mimetype and content_length to the JSON output. In this case it would be possible to filter by this values on later stages. I checked all providers and all of them return most of this fields. I am ready to implement this change, if you agree.

lc commented 1 year ago

Hey @zerodivisi0n, I definitely agree

zerodivisi0n commented 1 year ago

Great! Then I'll do it soon