matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.92k stars 2.66k forks source link

Allow to use JavaScript tracking and Log Analytics at the same time, and merge the data / deduplicate to avoid double counting #9665

Open hpvd opened 8 years ago

hpvd commented 8 years ago

Sometimes more than one data source is available for description/documentation of the same activity. In most cases the data source have different strength and also weakness. But combining them, the image of reality is always better than only using one source.

To give an example: there is a place with two different cameras looking at it from two different directions. One of the camera is a HD Color camera, mounted in a height of 10m and the other one is an black&white model, with lower resolution, mounted in a height of 2m, but it can make pictures also in the dark.

Both on their own can't document everything happening all day long on the place in perfect quality. But together they doesn't miss anything.

The same situation exists when trying to track activities using Piwik:

In the future when Piwik will become a "universal activity tracker" with v3 but also today when tracking "only" websites.

With Piwik's java script tracking you can track many many details. But there are things that may block Piwik's js: browser settings, browser add ons etc.

In this case these visits are not tracked. And what is even more worse from statistics pov: one do not only not now what these visitors have done, but one do not know how many visits were missed. With this some numbers in statistics like number of total visitors are bably broken. This may have effects on other things like e.g.Conversion rate not only in ecommerce (numbers of vistors/reached goals), impression counting when doing advertisments, etc.

With Piwik's analyses of server log files, all visitors are tracked -always. But not with that great details js tracking can do.

=> So why not making it possible to use data from different source and combine the best of both worlds to build a perfect image of reality?

When starting structural work on the core of Piwik for v3.0, it is a perfect point to think of these possibilities.

tsteur commented 8 years ago

It's a great idea and would be an awesome feature indeed. However, technically probably quite difficult. I presume we won't find time to work on this soon as we can maybe provide more value by spending the time on some other features. It would be really cool if someone could but some thoughts on how it could work technically. Eg how can we 100% correctly match a user tracked with JS with some logs from the webserver. Not sure if it's possible, especially when requests are coming from same IP / company.

hpvd commented 8 years ago

thanks for positive answer!

Well of course great things are not always easy and could also not very often be fulfilled within first approach ;-)

But there are things that could be done relatively easily:

e.g. to ensure that all sources always uses the exact same time base is a good start to make sync and combination possible or to accept and allow some kind of non user assigned actions (e.g. page visits from same IP but different visitors) is another one.

=> With this one can already optimize statistics on all fields where url + counting is enough:

probably there are some more -especially if one assume that a visitor in most cases (99%) would not change is "I'm track-able with js / I'm not track-able with js" status during a visit of the website....

hpvd commented 8 years ago

when doing this kind of statistic quality check "manual" one would e.g. set up two wesbites within Piwik for the same websities and let

After that one can look for a given period in the data above on both websites and compare them to

tsteur commented 8 years ago

But there are things that could be done relatively easily:

Good point and very true. Interesting approach of using it in 2 sites and comparing. Didn't even think of this initially. It could kinda work like tracking into 2 sites separately, then we check visits / actions against each other and merge them eg into a third site. It's still not super trivial but a simple proof of concept could be maybe made. There are still challenges eg when IP is anonymized it will be probably impossible to know if an individual was already tracked or not. This applies especially to German users.

A nice thing is the new kind of reports it would allow. For example we could have a site tracked with JavaScript, but still have bandwidth reports that are usually only available with log importer. We would maybe know how many resources were loaded etc. Still, merging this data won't be easy (eg when dealing with dates/times to find matching user etc there are always problems :) )

hpvd commented 8 years ago

merge them eg into a third site

perfect idea! ...good discussions brings one further than one can go alone.

=> regarding the other points you are mentioning: looks like you got hooked on this idea :-)

hpvd commented 8 years ago

For doing a combination like that, it would help very much to keep as many raw data of tracking and process and filter it later if needed (hide bots, spam, deleted visits)

hmmm the more I think, keeping raw data is not only helpful but essential to have chance to do combinations in an efficient way (and many other things) See +1 keeping raw data https://github.com/piwik/piwik/issues/8955#issuecomment-178479720

(and storage is becoming cheaper and faster every day, but visitor count (data production) on websites tracked with piwik is not enhancing with same speed)

gaumondp commented 8 years ago

Another use case for "more than just JS tracking" : External File download.

If someone link a file on their website, just using Piwik will not be enough since the downloads will not be fired by Piwik at all.

I have exactly that request right now to have "more precise" (external) downloads which is only possible thru Apache log files...

tsteur commented 8 years ago

that's a pretty good use case!

hpvd commented 8 years ago

to make this usable (and in general) Log analytics should be easier to use / acessible by more users opend a new ticket for this: #9711

hpvd commented 8 years ago

having the possibility to compare js tracking results easily with log import tracking results, it would help and be more easy to notice and identify problems and implausible values of one of them. So quality of result data would rise futher.

masteranalyze commented 8 years ago

:+1: for the ideea,this is exactly what i did thinked : " Interesting approach of using it in 2 sites and comparing. Didn't even think of this initially. It could kinda work like tracking into 2 sites separately, then we check visits / actions against each other and merge them eg into a third site. It's still not super trivial but a simple proof of concept could be maybe made."

2 sites,one tracked with java,one with server side tracking,and an 3rd website matching the data.

On the server side,to have an real picture,right now,are we able to filter GOOD BOTS + BAD BOTS ?

If we can filter :+1: GOOD BOOTS :+1: BAD BOTS :+1: Real Humans ,practically we can get an real picture.

Most of the good bots of course can be easily identify,because they use good practice,like having the word "bot" in their construction:googlebot,bingbot,adsensemediabot,etc Or some are using the word "crawler" or robot. The problem i think is on bad bots identify...maybe somebody haves some ideea how to filter the bad bots nowdays,witch does not use neither bot neither crawler neither robot,etc.

On joomla for example,there is the EORISIS piwik plugin and there is another if i remember very well from yoat or something like that,witch is only for server side tracking.

Eorisis piwik can track on joomla with :java,java+image,Server side. The other plugin can track only server side.

I tryed practically to run on the same website,eoris with java and the other with server side,the problem,is that if you enable both plugin,joomla crashes,so its not working,they get into conflicts,so you cant compare data.

Anyhow this should be done as @hpvd noted here : https://github.com/piwik/piwik/issues/9711

And things like "great details" like screen resolution ,plugins used,can be solved,if we implement misc tracker ,like awstats is doing,and i can detail this,as it`s documented and can be done for Piwik as well.

https://github.com/piwik/piwik/issues/9963#issuecomment-202895669

Like @hpvd said : " With Piwik's analyses of server log files, all visitors are tracked -always. " this is the only certain thing that you can have control as an website owner,on the server logs.

Maybe we can setup this as an milestone for piwik 3.

masteranalyze commented 8 years ago

@tsteur about : "There are still challenges eg when IP is anonymized it will be probably impossible to know if an individual was already tracked or not. This applies especially to German users."

Can you detail this ?Maybe i can help.Give more precise example of what you mean,and about what ip`s are you talking about.

@tsteur about : " A nice thing is the new kind of reports it would allow. For example we could have a site tracked with JavaScript, but still have bandwidth reports that are usually only available with log importer. We would maybe know how many resources were loaded etc. Still, merging this data won't be easy (eg when dealing with dates/times to find matching user etc there are always problems :) ) "

Why just not having 2 websites so we can compare,if anyone wish that,and maybe implementing misc tracker,as awstats is doing,for getting into user,their resolution,plugins used and so on.

That way via server logs ,it won`t be missed nice data tracked with javascript,and users with javascript enabled can be directly trackable into just 1 website.

And the real picture of the data,it can be achieved only if we can filter : REAL HUMANS,GOOD BOTS +BAD BOTS (i think the bad bots filtering is more hard) and if we implement what @hpvd said on this topic : #9711 ,piwik will be the only real data stats analytics tool. If we can filter that,people will be able to use either java,either server side tracking,either both for comparasion on the same website.

nicolasbadia commented 7 years ago

Being able to combine server logs and javascript tracking logs is also one of the first thing I thought about when I saw the log analytics features.

I don't really know how Piwik works internally but what seem feasible and really reliable to me would be to use an iterative process to merge javascript tracking logs into server logs. Server logs would be our reference as we are 100% sure they are correct. Then we would try to find a matching JS log with it. For this, we could do several loop which become less and less restrictive to merge the data. Here is an example of the condition we could use:

If we can't find a matching JS log for a server log, we ignore the JS log and add it to a no_matching_server_log.log file (which we might use to improve our process).

I believe this would prevent the use of 2 sites which I do not find really practical from a user perspective.

Here is a basic PHP implementation of what I am thinking of:

foreach ($serverLogs as $sl) {
  $sec = .5;
  $match = false;

  while (!$match && $sec < 60) {
    foreach ($jsLogs as $jk => $jl) {
      if (($jl['time'] - $sl['time']) < $sec && $jl['url'] === $sl['url'] && $jl['ip'] === $sl['ip']) {
        $match = $jk;
        break;
      }
    }
    $sec *= 1.5;
  }

  $sec = .5;
  while (!$match && $sec < 60) {
    foreach ($jsLogs as $jk => $jl) {
      if (($jl['time'] - $sl['time']) < $sec && $jl['url'] === $sl['url']) {
        $match = $jk;
        break;
      }
    }
    $sec *= 1.5;
  }

  if ($match) {
    $sl['jsLog'] = $jsLogs[$jk];
    unset($jsLogs[$jk]);
  }
}

Any thought on this?

mattab commented 7 years ago

Hi @nicolasbadia Yes that's the general idea (didn't look at the pseudo code). we'd need to make it really efficient and directly implement this feature, not in the log importer script, but in the Piwik Tracking API somehow

mattab commented 5 years ago

Question/comment from a user in email

Is it possible, when using the on-premise version of Matomo, to use the 'log analytics' method by default, but enable the 'javascript tracking' on a user-by-user basis?

Further context: According to the PECR we cannot create or access a cookie on a user's device for non-essential purposes without consent. According to the ICO, the GDPR's definition of consent applies here, which means it needs to be an explicit opt-in. Also according to the ICO, web analytics is not essential. Therefore, we cannot use the 'javascript tracking' unless the user gives consent, as you set a cookie in order to do this. Obviously, we don't want to completely lose tracking if the user does not consent, so we would like to be able to fall back to the 'log analytics' method.

Note: it's possible to disable cookies in Matomo tracker.

kwisatz commented 5 years ago

Question/comment from a user in email

Further context: According to the PECR we cannot create or access a cookie on a user's device for non-essential purposes without consent. According to the ICO, the GDPR's definition of consent applies here, which means it needs to be an explicit opt-in. Also according to the ICO, web analytics is not essential. Therefore, we cannot use the 'javascript tracking' unless the user gives consent, as you set a cookie in order to do this. Obviously, we don't want to completely lose tracking if the user does not consent, so we would like to be able to fall back to the 'log analytics' method.

Note that this is the exact question that brought me here. Using log analytics whenever consent hasn't been given and augmenting that data with js tracking if it has would be really useful.

Also, I honestly don't know whether cookies are the actual issue. The "idea" of GDPR is to ask consent for processing data and whether you set a cookie or not, you'll still be processing personnel data by injecting the javascript snippet and even by analysing logs. IANAL, but saying "we don't set a cookie and that makes all the problems go away" seems a little simplistic.

KokoKoder commented 5 years ago

I think you are correct kwisatz GDPR doesn't allow to process apache logs for tracking purpose if consent was not given.

‘processing’ means any operation or set of operations which is performed on personal data or on sets of personal data, whether or not by automated means, such as collection, recording, organisation, structuring, storage, adaptation or alteration, retrieval, consultation, use, disclosure by transmission, dissemination or otherwise making available, alignment or combination, restriction, erasure or destruction;

mattab commented 4 years ago

This other technical solution is interesting too: https://github.com/matomo-org/matomo/issues/13023 using a SDK eg. PHP SDK on the server rather than using log files. It has some upsides (not having to use log analytics) and downsides (only works for PHP, will need a SDK implementation for each language, might be hard to send asynchronous https requests without performance impact to the site, ...)

masteranalyze commented 4 years ago

I think you are correct kwisatz GDPR doesn't allow to process apache logs for tracking purpose if consent was not given.

GDPR folks applies as an general thumb to Europe only,GDPR haves nothing to do with your server,with your server location witch can be OFFSHORE,nothing to do with your apache logs ,GDPR is an law act that applies only in EU based countries,if my servers are outside EU,EU does not have no juridisction,you can follow or not follow those rules that is your problem.

The main fact that an software offers the option to track analytics data witch is an must for any website,with or without coockies is called Software Option,if you don`t like that option you have the option to not use it,you have the option to not use any software that is not suitable for your use.

GDPR was made in order to protect users from viruses,malicious injections via coockies,GDPR haves nothing to do the way you process your data in your servers,that is strictly your problem.

You are the master of your analytics,not the user.

Should we ask the user : Hey user,do you consent to receive right now 100$ ?!Just click,yes!

The users are just users,they are not technically IT specialised work force ,depending on the question you ask,any user could answer YES or No,but for an Webmaster for his analytics,is not important if the answer is YES or NO,it matters to be able to see the real picture of his website.

Also all websites haves what is called : TOS ,in your Terms and conditions you can write your own website rules,so if that user wants to use your website,by accepting your TOS,he will accept coockies and all your terms and conditions,else he can go in any other website if he does not like your TOS simple as that.In order to use my website you must respect my TOS,you don`t want to respect my TOS,go somewhere else,simple as that.

GDPS,JSLSA,EJAIS,cannot impose my TOS,as GDPR is not paying for servers,technically support,etc is not their business,is your business,else GDPR will pay for all the loses of your business or what?As based on analytics business could take critical decision,because of those decision an company can grow,or an company can go bankrupt.

If i have 100.000 users and no users are giving CONSENT OF COOCKIES,and 100 give,i will know that i had 100 visits,not 100.100 visits,witch is something else,as an Web Analytics Webmaster i want my analytics to be clear,not to be fake because of some stupid non technically birocrats that gives all kinds of laws,witch are even more stupid then them.

Instead of using Fake analytics and data,you better just not use no analytics at all,you can go BLIND and by GDPR,KHFAL,KJGM or whatever stupidity they might think on next time.

If the user give or not give consent,they are not protecting the user,especially in any clear white hat website.

Now in an black hat website,witch their main purpose is to infect the users via malware with an virus,do you really think GDPR can really protect the users from the "bad guys"?/??

The only way DUMB users can protect themself,is by IT Education,only Education can protect them,if we give an law tommorow ALL USERS of the internet will be Protected because we say so,do you think creators of viruses,malwares cares about what we write on same paper and will not harm the dumb userss??Of course,they will harm them no matter what law is written or not-written.You can`t just give an law and automatically protect anyone and by giving that law all bad people will become ANGELS from tommorow and everyone will be Happy,unfortunately this is not the way things functions on this world.

GDPR cannot protect nothing,is just some rules that you should follow and it was writed in order to not harm the user with viruses,malwares,etc with Web Analytics,you cannot harm no one,you are just collecting data about your Users,your not infecting the people with viruses ,malwares,by tracking their actions.

For example people that are using FREE websites that are based on Advertisign ,without advertisign those websites are dead,as lots of users are using : AD BLOCK ,Ublock,all kinds of blockers,the Webmasters implemented solution to discover the users that are using Ad block,and as an user you must UN LOCK the website,so you will see Ads,else you can go wherever you go,but you cannot access my servers,my website,my resources,etc.

It`s the final choice of the user if he wants to enter my house,he needs to respect my rules,if he does not want to respect them,no problem you will not enter my house,very simple.

masteranalyze commented 4 years ago

Javascript unique Identifier : IP

Server side unique Identifier : IP

Server logs would be our reference as we are 100% sure they are correct. Then we would try to find a matching JS log with it.

Ip Unique Identifier - for merging accurate data.

Server log 100 % reference are correct everytime,javascript log in the same unique identifier IP .

For java not found as nicolasbadia said: "If we can't find a matching JS log for a server log, we ignore the JS log and add it to a no_matching_server_log.log file (which we might use to improve our process)."

But i don`t think there will be such case,because on server side everything is tracked 100% ,and we just need to put javascript reports basicaly merged in same user report,but the main identification of both world is the IP Adress.

LeoniePhiline commented 4 years ago

GDPR was made in order to protect users from viruses,malicious injections via coockies,GDPR haves nothing to do the way you process your data in your servers,that is strictly your problem.

This is just plain false. GDPR protects users from aggregation and creation of unwanted online-profiles.

masteranalyze commented 4 years ago

Is not false,that was their initially intention.

Gdpr does not protect users from aggregation creation or even selling those user data,because the user haves to accept Tos of the website,user cannot do nothing except leaving that website if he does not accept the Tos of the websites.

Users are not owners of websites,they don t even need gdpr if they don't like the Tos of any website,is simple Exit.

But if the user register into that website and give consent that he accepted Tos of that website,gdpr will not protect that user,if that user make request that his info to be deleted from that website,the webmaster will just delete that user,and that user won't have anymore access to the resources of that website,is very simple.

Is an "false protection" ,is like someone will give you right now an writed law on paper that they will protect users against Coronavirus, unfortunately they cannot do nothing,and they can't throw Coronavirus in jail,because Coronavirus does not know no law.

You have to protect yourself by education,not by relying on some birocratics pieace of paper,they won't protect anyone,neither the users neither the webmasters.

Anyhow i think Gdpr is out of topic,because the topic is about :Allow to use JavaScript tracking and Log Analytics at the same time, and merge the data / deduplicate to avoid double counting,not about GDPR or KClm or whatever they will invent in future.

LeoniePhiline commented 4 years ago

Gdpr does not protect users from aggregation creation or even selling those user data,because the user haves to accept Tos of the website,user cannot do nothing except leaving that website if he does not accept the Tos of the websites.

Nope – GDPR states that your service has to be usable regardless of the user's agreement for having their behavior tracked.

But if the user register into that website and give consent that he accepted Tos of that website,gdpr will not protect that user,if that user make request that his info to be deleted from that website,the webmaster will just delete that user,and that user won't have anymore access to the resources of that website,is very simple.

Nope – the "webmaster", as you call them, is to delete the data you are asking to get deleted. Exceptions are data that they are obliged to keep for legal reasons (e.g. for their tax declaration).

Is an "false protection" ,is like someone will give you right now an writed law on paper that they will protect users against Coronavirus, unfortunately they cannot do nothing,and they can't throw Coronavirus in jail,because Coronavirus does not know no law.

Implementing GDPR, those who do not adhere to the "written[n] law on paper" can be and are being fined. Fines are a lot higher than what you are going to want to pay.

masteranalyze commented 4 years ago

Nope – GDPR states that your service has to be usable regardless of the user's agreement for having their behavior tracked.

Your in error,gdpr is not the owner of the website and server,if i don t want you in my club because you don t dress like my tos is saying : white shirt,your out of the club. If i want to throw you out,i throw you out,period,you don't respect my house my rules,go build your own. The server can be in location where gdpr does not apply.

Yes the data is requested by user would be deleted,however as the user did submit that data voluntary and not forced by anyone,is the user responsibility what they post or share or request.

Gdpr is not owner of any website.And apply only in Eu.

Gdpr haves nothing to do with the features witch are on/off from an software .

Their is no fine if your server is in Russia or Sudan or Belize,they have no jurisdiction over there.

Like i said gdpr is just an eu directive,they cannot impose an eu directive outside their jurisdiction.

If you don't like the Tos of an website,they have no obligation to make that site available or that resources available to you,simple as that,they can even ban your ip and your gone if your trouble maker to that website owner.

martin-neumann-gurus commented 3 years ago

Chiming in into the GDPR discussion. GDPR says that for profiling a user activity you need to have their permission. That means he needs to agree even before you set any cookies. But it allows processing of statistical data on the ground of justified interest of the website provider. You can do statistical data without needing to ask for a consent. But even then IP addresses need to be stored anonymized. I believe though it would be justifiable to match the log file IP with the Javascript tracked IP and anonymize them after the fact for permanent storage.