qri-io / walk

Webcrawler/sitemapper
GNU General Public License v3.0
6 stars 2 forks source link

Recognize URLs that should not be crawled #22

Open Mr0grog opened 5 years ago

Mr0grog commented 5 years ago

I just got finished cleaning out some website login pages from EDGI Web Monitoring’s archives that we never really should have been tracking (they’d been getting captured because of a crawl that was done early on). We had several, since login URLs tend to look something like /login?return_to=/some/other/page.html, where the querystring varies widely, but without changing the page being returned.

The concern around whether we want to track a login page isn’t a big deal and probably isn’t something Walk should address. But the noise created by so many URLs that really the exact same page might be.

This is definitely not a short-term/MVP feature, but I wanted to record it here while it was fresh in my mind.