tomnomnom / waybackurls

Fetch all the URLs that the Wayback Machine knows about for a domain
3.43k stars 457 forks source link

Fix CommonCrawl URLs #4

Closed Rhynorater closed 6 years ago

Rhynorater commented 6 years ago

The current CommonCrawl fetch url is this:

http://index.commoncrawl.org/CC-MAIN-2018-22-index?url=*.%s&output=json

I would suggest that it should be this:

http://index.commoncrawl.org/CC-MAIN-2018-22-index?url=*.%s/*&output=json

See the difference in results in the following: http://index.commoncrawl.org/CC-MAIN-2018-22-index?url=blog.innerht.ml/*&output=json as opposed to how you currently have it: http://index.commoncrawl.org/CC-MAIN-2018-22-index?url=blog.innerht.ml&output=json

Thanks, Justin

tomnomnom commented 6 years ago

Ah! Good spot! Should be sorted as of 3279764 :)