matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.92k stars 2.66k forks source link

Normalize referrer domains #4033

Open gka opened 11 years ago

gka commented 11 years ago

Listing the referrer websites can be significantly improved by normalizing the domain names. Currently subdomains such as "www7" are treated as separate website. Here's an example of such a referrer list, in which you see that lemonde.fr is listed several times:

[[Image(http://new.tinygrab.com/f3aa221edeba52ea05e91e20b51690a2c38c508b47.png)]]

Of course this is not trivial, as some sub-domains are pointing to separate websites while others are only mirrors or mobile variants of the same site.

To solve this issue, Mozilla maintains a list of "effective" tld names. This list includes domains such as bl0gsp0t.com and dyndns.org, because X.dyndns.org should be treated as a separate websites.

http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat?raw=1

Using this list it is easy to normalize the domains, or in other words, to extract the "effective" websites. The list is not perfect (for instance tumbr.com is missing) but it should solve 95% of the problem.

mattab commented 11 years ago

Good idea to use a list to improve the referrer website. For lemonde example though, I feel like having all the subdomains brings value as it helps seeing which sub-sites bring more traffic. lemonde is not in the list so it makes sense.

We could also implement this as a plugin in the upcoming marketplace at: http://plugins.piwik.org/

gka commented 11 years ago

Another very smart solution would be to do just group the visits by domain and subdomain. This seems to be easier as we don't need to maintain the effective tld list at all. The result could look like this:

||= Website =||= Visits =|| || guardian.co.uk || 503108|| || lemonde.fr || 303471|| || - www.lemonde.fr || 177113|| || - decodeurs.blog.lemonde.fr || 83375|| || - emploi.blog.lemonde.fr || 30323|| || - abonnes.lemonde.fr || 7412|| || - mobile.lemonde.fr || 2652|| || - alicedsl.lemonde.fr || 2596|| || derstandard.at || 58850||

Ok, we might still need to maintain a shorter list of effective TLDs where we put some country-specific TLDs in, such as co.uk, but we don't need to cover company specific TLDs such as blogsp0t.com, as users can easily unfold the domain to see what blogs are linking most.

(btw I hate this comment system which always blacklists my comments just because I include blogsp0t.com. silly!)

mattab commented 11 years ago

Great idea to add a new "view" of the report with subtables showing subdomains.

Maybe we show such new report as a new footer link Related Report "Websites by Domain" under "Websites" report

Or maybe as a "COG" dropdown option.

gka commented 11 years ago

I would prefer making the hierarchical view the new default and then let the user "make it flat" as we are doing with the Pages report.

Anyone thinking that the flat view is better than grouping by domain?

mattab commented 10 years ago

Nice idea for a plugin which could filter out the Referrers dataTable to make the grouping as explained here!

gka commented 10 years ago

As a first step toward this I worked on a PHP implementation for extracting the "effective" domain name of an hostname.

Usage is very simple:

> include('EffectiveDomainName.php');

> print EffectiveDomainName::get('mobile.nytimes.com') . "\n";
nytimes.com

> print EffectiveDomainName::get('flightjs.github.io') . "\n";
flightjs.github.io

> print EffectiveDomainName::get('www.google.com.br') . "\n";
google.com.br

https://github.com/gka/effective-domain-name

mattab commented 9 years ago

@gka Thanks for the tip.

Weird that this issue got closed, I don't think I closed it unless it was by mistake...

It would be relatively easy to create a plugin that will either modify existing getWebsites or add new related report report where we will call a filter GroupBy that will group rows by "effective domain".

mattab commented 9 years ago

Would you also group t.co under twitter.com ?

and maybe group m.facebook.com and lm.facebook.com under facebook.com ?

gka commented 9 years ago

Since facebook.com is not listed as effective TLD (aka "public suffix"), any subdomain *.facebook.com will indeed be "normalized" to facebook.com. However, t.co is not being "grouped" with twitter.com, as both are entirely different domains.

mattab commented 9 years ago

Hi @gka alright

maybe we could use your list and then customise it with all known social networks domains for example. I'm setting to Short term as it's quite easy to build this at least in a plugin on the Marketplace

we'd simply apply the normalisation function in a custom filter, that would GroupBy the labels by the normalisation function. it would ideally be possible to disable it in the Cog icon menu.