matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.71k stars 2.62k forks source link

wrong results for getHostName(url) in piwik.js #19634

Open volker-attempto opened 2 years ago

volker-attempto commented 2 years ago

I experienced, that the tracked hostname was wrong when query parameters are provided, which include the @ (at) symbol.

Expected Behavior

hostname is extracted correctly from href / url

Current Behavior

string after last @-sign is treated as domain name

Possible Solution

fix the RegExp in piwik.js

Steps to Reproduce (for Bugs)

  1. set up tracking script with a url like 'http://www.example.org:3000/passwort-zuruecksetzen?email=email@example.com&code=844815'
  2. in matomo backend you can see 'example.com&code=844815' as hostname

To reproduce, I copied the getHostName function from piwik.js into seperate file:

function getHostName(url) {
            // scheme : // [username [: password] @] hostname [: port] [/ [path] [? query] [# fragment]]
    var e = new RegExp('^(?:(?:https?|ftp):)/*(?:[^@]+@)?([^:/#]+)'), matches = e.exec(url);

            return matches ? matches[1] : url;
        }

console.log(1, getHostName('https://www.example.org'));
console.log(2, getHostName('https://www.example.org?code=1234'));
console.log(3, getHostName('https://user:passwd@www.example.org?code=1234'));
console.log(4, getHostName('https://user:passwd@www.example.org/bla?code=1234'));
console.log(5, getHostName('http://www.example.org:3000/passwort-zuruecksetzen?email=email@example.com&code=844815'));
console.log(6, getHostName('http://www.example.org:3000/passwort-zuruecksetzen?email=email%40example.com&code=844815'));
console.log(7, getHostName('http://user:pass@www.example.org:3000/passwort-zuruecksetzen?email=email@example.com&code=844815'));

output should IMO always be 'www.example.org' node hostname.js

1 www.example.org
2 www.example.org?code=1234
3 www.example.org?code=1234
4 www.example.org
5 example.com&code=844815
6 www.example.org
7 www.example.org

Context

Your Environment

justinvelluppillai commented 2 years ago

Great bug report, thanks! We can improve this regex which is also currently used in optOut.js. I will put it in the queue for prioritisation by the product team.