tomnomnom / waybackurls

Fetch all the URLs that the Wayback Machine knows about for a domain
3.41k stars 455 forks source link

Feature request: set a flag to http or https only to remove duplicates #9

Open ghost opened 4 years ago

ghost commented 4 years ago

Hey Tom, I've been playing around with waybackurls for a few days. Love it, thanks for writing a cool tool.

I'd like to preface this by saying It's probably just easier to use sort and uniq straight from the start to remove duplicates - or even run sed to filter out anything before '://' to get the domain only, then sort duplicates.

Anyway, here's the crux of the problem and the suggestion. When inputting a domain list through the use of cat ($ cat domains | waybackurls > urls), you can easily duplicate what is being run in waybackurls, as it seems Waybackmachine itself indexes pages on both http and https regardless if specified or not.

Running waybackurls with a domain list of:

example.com
http://example.com
https://example.com

Will return the exact same results for each, both with http and https archives. The http(s) archives aren't the issue, that's expected. It's more the fact that it takes twice as long to finish if full links haven't been sorted and filtered out of the input file.

While your example on the readme.md uses a domain name without http(s), it can be used with http(s) prefixed. I think a good solution would be an optional bool flag that filters one or the other.

Thanks for you time :+1: