hoarder-app / hoarder

A self-hostable bookmark-everything app (links, notes and images) with AI-based automatic tagging and full text search
https://hoarder.app
GNU Affero General Public License v3.0
6.34k stars 226 forks source link

Remove tracking params from URL before hoarding it #633

Open raviwarrier opened 1 week ago

raviwarrier commented 1 week ago

Describe the feature you'd like

Most times you are on a website or an app and hoard stuff and usually they come with tracking params - utm, fb, ms, etc. and it would be nice to be able to save a clean URL.

I had created a url cleaning workflow on n8n for myself that cleans a copied URL, re-copies the clean URL and then sends it to my other devices (laptop to phone, phone to laptop).

here's the code that I wrote in JS for n8n workflow if it helps:

// Define the tracking parameters as a Set for efficient lookup
const trackingParams = new Set([
    'utm_source', 'utm_medium', 'utm_campaign', 'utm_term', 'utm_content',
    'utm_id', 'utm_cid', 'utm_reader', 'utm_name', 'utm_social', 'utm_placement',
    'gclid', 'dclid', 'gclsrc', 'wbraid', 'gbraid', '_gl', 'fbclid', 
    'fb_action_ids', 'fb_action_types', 'fb_source', 'fb_ref', 'fb_comment_id',
    'msclkid', 'ocid', 'wt.mc_id', 'cvid', 'WT.mc_id', 'wt.mc_ev', '_hsenc',
    '_hsmi', 'mc_cid', 'mc_eid', '_ga', '_ke', 'ir_campaign_id',
    'ir_ad_id', 'cid', 'eid', '_bta_tid', '_bta_c', 'trk_contact', 'trk_msg',
    'trk_module', 'trk', 'mkt_tok', 'wickedid', 'wickedsource', 'share',
    'share_id', 'share_source', 'cmpid', 'social', 'socid', 'twclid', 'pinid',
    'igshid', 'igref', 'lipi', 'vero_id', 'mc_', 'ref_', 'dm_i', 'epik', 
    'mailid', 'mid', 'spMailingID', 'spReportId', 'spUserID', 'ss_campaign_id',
    'ss_email_id', 'ss_source', 'subscriber', 'tag', 'psc', 'pd_rd_r',
    'pd_rd_w', 'pd_rd_wg', '_encoding', 'linkCode', 'linkId', 'affiliate',
    'affiliate_id', 'affid', 'sourceid', 'source_id', 'source', 'ref', 'referral',
    'referer', 'referrer', 'refid', 'ref_id', 'mpid', 'clickid', 'click_id',
    'adjust_tracker', 'adjust_campaign', 'adjust_adgroup', 'app_id', 'app_name',
    'device_id', 'platform', 'admitad_uid', 'adset', 'adset_name', 'adgroup',
    'ad_id', 'ad_name', 'adposition', 'campaignid', 'placement', 'target',
    'loc_physical_ms', 'loc_interest_ms', 'action', 'campaign_date', 'campaign_id',
    'force_sid', 'geo', 'id', 'key', 'medium', 'notification', 'offer', 'sid',
    'site', 'timestamp', 'tracking', 'user', 'visitor', 'webid', 'wmid', 
    'browser', 'browser_version', 'os', 'os_version', 'viewport', 'resolution',
    'language', 'country', 'region', 'city', 'session_id', 'session', 'user_id', 'h_ad_id',
    'userid', 'visitor_id', 'client_id', 'clientid', 'cust_id', 'custid', 'fbc_id', 'igsh'
]);

/**
 * Function to clean a URL by removing tracking parameters.
 * @param {string} inputUrl - The original URL to be cleaned.
 * @returns {string|null} - The cleaned URL or null if invalid.
 */
function cleanUrl(inputUrl) {
    try {
        // Remove all null characters and control characters from the URL
        const sanitizedUrl = inputUrl.replace(/[\u0000-\u001F\u007F]/g, '').trim();

        // Parse the sanitized URL
        const url = new URL(sanitizedUrl);
        const params = url.searchParams;

        // Collect keys to delete
        const keysToDelete = [];

        for (const key of params.keys()) {
            // Check for exact match
            if (trackingParams.has(key)) {
                keysToDelete.push(key);
            } else {
                // Check for prefix matches (e.g., 'mc_', 'ref_')
                for (const param of trackingParams) {
                    if (param.endsWith('_') && key.startsWith(param)) {
                        keysToDelete.push(key);
                        break; // No need to check other prefixes
                    }
                }
            }
        }

        // Remove the identified tracking parameters
        keysToDelete.forEach(key => params.delete(key));

        // Reconstruct the cleaned URL
        // Handle cases where there are no remaining query parameters
        const cleanedUrl = params.toString() 
            ? `${url.origin}${url.pathname}?${params.toString()}${url.hash}` 
            : `${url.origin}${url.pathname}${url.hash}`;

        return cleanedUrl;
    } catch (error) {
        // Log the error for debugging purposes
        console.error('Error cleaning URL:', error);
        return null;
    }
}

// Access the incoming data from the webhook
const items = $input.all(); // Get all incoming items

// Process each incoming item
return items.map(item => {
    // Extract the URL from the payload
    // Refer to it as 'the input URL'
    let inputUrl = item.json.body.body || '';

    if (!inputUrl) {
        // Handle cases where the URL is not found
        return {
            json: {
                error: 'URL not found in the webhook payload.'
            }
        };
    }

    // Clean the sanitized URL
    const cleanedUrl = cleanUrl(inputUrl);

    if (cleanedUrl) {
        // Return the cleaned URL
        return {
            json: {
                cleanedUrl
            }
        };
    } else {
        // Handle invalid URL scenarios
        return {
            json: {
                error: 'Invalid URL provided.'
            }
        };
    }
});

these were all the tracking params I could find, but I am sure there are more. maybe, if and when you incorporate this functionality, you could also include using AI to check if the cleaned up url has any remaining,

Describe the benefits this would bring to existing Hoarder users

Would help with having clean URLs that do not automatically start tracking them when the open the links.

Can the goal of this request already be achieved via other means?

Not really. people could manually clean up links or use automation workflows like I do, but that's not possible when you are directly sharing to Hoarder (either via extension or mobile app).

Have you searched for an existing open/closed issue?

Additional context

No response

raviwarrier commented 1 week ago

Also, this could be a togglable setting. "Clean all URLs before saving?" yes/no. and "Ask to clean be saving?" yes/no.

The reason for "ask to clean" is because some links (like of webapps) may break if it doesn't have certain tracking params, and so a person can choose not to clean a specific URL if the "clean all urls..." is marked as 'yes'.