passiomatic / coldsweat

Web RSS aggregator and reader compatible with the Fever API
MIT License
146 stars 21 forks source link

Clean up feed URL from tracking params before add it to database #60

Closed passiomatic closed 10 years ago

passiomatic commented 10 years ago

Bottle Fever has a function to strip tracking params from feed URL's. This makes URL's more clean and avoid feed duplication — e.g. two URL's point to same feed but they are different in a character-by-character match.

Original code follows:

def scrub_query(url):
    """Clean query arguments"""

    scrub = ["utm_source","utm_campaign","utm_medium","piwik_campaign","piwik_kwd"]

    url = urlparse.urldefrag(url)[0]
    base, sep, query = url.partition('?')
    seen = set()
    result = []
    for field in query.split('&'):
        name, sep, value = field.partition('=')
        if name in seen:
            continue
        elif name in scrub:
            continue
        else:
            result.append(field)
            seen.add(name)
    result = '?'.join([base, sep.join(result)]) if result else base
    # strip dangling '?'
    if result[-1:] == '?':
        result = result[:-1]
    return result

From: https://github.com/rcarmo/bottle-fever/blob/master/lib/utils/urlkit.py