spamscanner / url-regex-safe

Regular expression matching for URL's. Maintained, safe, and browser-friendly version of url-regex. Resolves CVE-2020-7661 for Node.js servers.
https://forwardemail.net/docs/url-regex-javascript-node-js
MIT License
79 stars 14 forks source link

Extra characters at the end #26

Closed katlim-br closed 2 years ago

katlim-br commented 2 years ago

Hey

So I used this code to extract some urls from a html file.

const urlRegex = require("url-regex-safe");
function urlsFromText(text) {
    if (!text) {
        return [];
    }

    const matchingUrls = text.match(urlRegex({
        localhost: true,
    }));
    return matchingUrls || [];
}

When I run it, I get the following matches (the html does have the same links across several places) and as you can see, there are some places where it is found ok, and others are not removing the last characters. NOTE: I changed the hostname, but is ok, that is not the part that is failing.

[
"http://thisdomain.com/c/eJwtjsEKwyAQRL8muUU2qzF68NBLf6NsNGkEbYJryO_XQmHgwTDMTHBLmLzEPjoERBhBgVYGjcAwWw12lAq8DOg7BZliEr7kclAQ_sj97qzXejVWBW2lwcnONM4G1DLJAITr1ie313pyJx8dPpvu-xb3mhJvVN7Hr6WZZ1mZG9u--gMQphdnSmlYLo6fFhiOs8YcOffF8U47lfZpScS1UIgXC4pf7dg_Yw",
"http://thisdomain.com/c/eJwtjsEKwyAQRL8muUU2qzF68NBLf6NsNGkEbYJryO_XQmHgwTDMTHBLmLzEPjoERBhBgVYGjcAwWw12lAq8DOg7BZliEr7kclAQ_sj97qzXejVWBW2lwcnONM4G1DLJAITr1ie313pyJx8dPpvu-xb3mhJvVN7Hr6WZZ1mZG9u--gMQphdnSmlYLo6fFhiOs8YcOffF8U47lfZpScS1UIgXC4pf7dg_Yw>.",
"https://thisdomain.com/ls/click?upn=Edw9Mjq0OQ4hwVYZdSS19DmzTkQO2hOfz77XO47T2bIMc3fVT4uKWvVpzZJxLCO-2BnMr4_n1llmOec-2BgJkFgpT9Du8t95rbeVygh6Lk33ithME8pCC9rzKG6j6Ja37TSev7QnwrTdkQhfH80qgFSxfMHmNaXGZNOk-2Fah53KlwQ7jgpJujJwj8MSytOl1hYAwh9wbU6yCqiOm0BH8MT1C606xPjKSfXMcQhi6XbDKFSpeCfAX2BSplJFosHqoO-2B47y56WQ-2BMAjh5TPyYzCTBsVurHpCTeYNo17KesLVQSfiE4yBkMNN-2BlStPCGUbntKRMrf-2BnL0cbPriBj1FSi86bbTY6q6vT2wXwB-2BognImKofq803zMLG2JNz6lR1-2Bo7ms72uVRfaNP2xuG3hM2hDzfXDhcTuJXCMrdnKreeZEhSHuS77-2FYXZ1IP35IVzKn6H8MD05V758Ig6FB5GALPf6RS7g7aV-2Fw7U-2FxFGrxjg6QgEWdYh1Jg-3D",
"https://thisdomain.com/ls/click?upn=Edw9Mjq0OQ4hwVYZdSS19DmzTkQO2hOfz77XO47T2bIMc3fVT4uKWvVpzZJxLCO-2BnMr4_n1llmOec-2BgJkFgpT9Du8t95rbeVygh6Lk33ithME8pCC9rzKG6j6Ja37TSev7QnwrTdkQhfH80qgFSxfMHmNaXGZNOk-2Fah53KlwQ7jgpJujJwj8MSytOl1hYAwh9wbU6yCqiOm0BH8MT1C606xPjKSfXMcQhi6XbDKFSpeCfAX2BSplJFosHqoO-2B47y56WQ-2BMAjh5TPyYzCTBsVurHpCTeYNo17KesLVQSfiE4yBkMNN-2BlStPCGUbntKRMrf-2BnL0cbPriBj1FSi86bbTY6q6vT2wXwB-2BognImKofq803zMLG2JNz6lR1-2Bo7ms72uVRfaNP2xuG3hM2hDzfXDhcTuJXCMrdnKreeZEhSHuS77-2FYXZ1IP35IVzKn6H8MD05V758Ig6FB5GALPf6RS7g7aV-2Fw7U-2FxFGrxjg6QgEWdYh1Jg-3D>;",
"https://thisdomain.com/ls/click?upn=Edw9Mjq0OQ4hwVYZdSS19LQ-2FdstCwdNG97aq-2BoKXcUNnhvG3KpLkcq0oeyJNtaudeZ0V_n1llmOec-2BgJkFgpT9Du8t95rbeVygh6Lk33ithME8pCC9rzKG6j6Ja37TSev7QnwrTdkQhfH80qgFSxfMHmNaXGZNOk-2Fah53KlwQ7jgpJujJwj8MSytOl1hYAwh9wbU6yCqiOm0BH8MT1C606xPjKY4AYzg-2BbyFJle44p2Nwr3WIWW3AiLXnesEuTNuz17FZAbx6h2oWpO8I-2FbW4LJl88L6h6QCn5mnYgDikeWl-2FKWL-2BrgosEqEoH-2FskquLIQktySB1kz6M-2FT-2BhXu8C2DdXlfI3ahSRNjQIvkwp-2FzFbTdlxJ32vRnbdSrmTJS97orQlk0q2wr9jr9QMYq4hKUIrjNuyEO7AFhK7N8pzPq-2FNbR4BJEauwBP33v7NWzR-2BQ4VFbdI-2B7E4t04555TlbB0ndkLaGJ2hyI1o1YECwmiqWkceI-3D>[image:",
"https://thisdomain.com/ls/click?upn=Edw9Mjq0OQ4hwVYZdSS19PFHnXZ1cLWsRvhx9RaY-2BQAS5Vos-2BHGFwfuQwfhpbU-2FZ7JjEkayk2WmqvPwVmk2DWQ-3D-3DI3QJ_n1llmOec-2BgJkFgpT9Du8t95rbeVygh6Lk33ithME8pCC9rzKG6j6Ja37TSev7QnwrTdkQhfH80qgFSxfMHmNaXGZNOk-2Fah53KlwQ7jgpJujJwj8MSytOl1hYAwh9wbU6yCqiOm0BH8MT1C606xPjKcCDbuMBMgW5oVfHk-2BCaODfQayFCp9YHYhzQAPKJJbSmYqbtbTZ98nj0XgwxLBsj8NSfuVuXc1KqTFvvMKzlByWqJSDk7JWWOhJEoG3D9NphMRpU69JGqsu-2BDnC8c4XQxC-2BeSx-2FvQ1J0C0dEMt1kAQilciDJK926NIxxyok4LZSp-2FVoIe4H3LLTYGv1H8MSN1R4REVk8n6uCvjmox0-2Blq-2FOUFtwLCOQF-2BkqqM9gbAPhWSBnVBZcwHalHIktdK2pN-2BmXznQQ4R0yYRELFY2-2BcMrI-3D>[image:",
]

Is this a bug? or is it something we can improve somehow without changing the library?

Thanks!

katlim-br commented 2 years ago

more info, this comes from the text/plain version of an email (which originally was html) and some clients convert to text version so it is rendered correctly in any other client.

and it seems it is because of the ">" character.

As a workaround, we did a preprocessing step by text.replace(/>/g, " > ") to add spaces to the text, but it is suboptimal.

niftylettuce commented 2 years ago

If you can submit a PR to fix this, and/or add tests that fail that would be great!

titanism commented 1 year ago

v4.0.0 released with this fixed

release notes @ https://github.com/spamscanner/url-regex-safe/releases/tag/v4.0.0

note: this version now requires node v14+