richardg867 / WaybackProxy

HTTP proxy for tunneling requests through the Internet Archive Wayback Machine
GNU General Public License v3.0
646 stars 55 forks source link

Random URL "leakage" from newer dates outside date tolerance #32

Closed breadtf closed 4 months ago

breadtf commented 4 months ago

I am using waybackproxy to crawl pages saved on the wayback machine (because it was the easiest and fastest thing to set up) However, I've noticed some random "leakage" from newer dates. The proxy is set to January 1st, 2003. However some pages from years after are randomly appearing. For example: As we all know, Youtube released in 2005, and was bought by Google in 2006, but here it is in my data, showing up on a google support page (Which I doubt even existed in 2003!)

...
{
    "url": "https://support.google.com/",
    "title": "Google Help",
    "tags": [
        "center",
        "search",
        "youtube"
    ]
},
...

(The tags are based on the most common words on a page) This isn't just a one off thing, as a bit further down...

...
{
    "url": "https://accounts.google.com/ServiceLogin?hl=en&passive=true&continue=http://support.google.com/&ec=GAZAdQ",
    "title": "Sign in - Google Accounts",
    "tags": [
        "use",
        "account",
        "email"
    ]
},
...

... we see a Google accounts page, which definetly was NOT a thing in 2003. There are 41 occurences of this after running the proxy & crawler for just two-ish minutes.

I don't see any other pages experience this "leakage", only Google pages. Is there any way to fix this?