lede / lede

0 stars 0 forks source link

canonicalize URLs #99

Open ryanmeador opened 11 years ago

ryanmeador commented 11 years ago

We need to follow and store obfuscated links (like those that are used for click tracking, e.g. CNN) to their real destination, so that we can properly match them up with other links to the same source -- no two of them will be the same, so resolving them into a canonical form is the only way we'll be able to use them to form our graph. Also, this process will allow us to longify short URLs (tinyurl, bit.ly, etc). And if we do it right, it might let us coalesce links that point to HTTP or HTTPS versions of the same resource.

ryanmeador commented 11 years ago

Some sites, notably CNN's blogs, include a which we should probably use.

ryanmeador commented 11 years ago

Part of issue #141