Recognize URLs that should not be crawled

I just got finished cleaning out some website login pages from EDGI Web Monitoring’s archives that we never really should have been tracking (they’d been getting captured because of a crawl that was done early on). We had several, since login URLs tend to look something like /login?return_to=/some/other/page.html, where the querystring varies widely, but without changing the page being returned.

The concern around whether we want to track a login page isn’t a big deal and probably isn’t something Walk should address. But the noise created by so many URLs that really the exact same page might be.

In the most general case (arbitrary query args that are just noise), there’s probably not much we can do automatically. But the configuration could accept a list of patterns (regexes?) to ignore.
For some more recognizable cases (like the login pages I was dealing with), we might be able to gin up some heuristics for automatically recognizing them. Would it be better to…
1. Use the heuristics to auto-ignore some URLs, or
2. Use the heuristics to report (via logs or some other site/job-level metadata file/endpoint) that some URLs might be good candidates for ignoring via configuration (so in this case, Walk would still have taken the conservative approach and crawled them because they weren’t explicitly configured to be ignored)
We might be able to write a fancy resource handler that can identify snapshots where the URLs differed only by the querystring but returned the exact same content. (Or close to exact, ignoring certain things in the HTML?) We could then use that info similar to how we could use heuristics above. The same two questions about what we’d do with that data would apply here, too.

This is definitely not a short-term/MVP feature, but I wanted to record it here while it was fresh in my mind.

qri-io / walk

Recognize URLs that should not be crawled #22