matomo-org / plugin-TrackingSpamPrevention

GNU General Public License v3.0
12 stars 7 forks source link

Add exclusion of Direct view and x seconds #112

Open matomoto opened 8 months ago

matomoto commented 8 months ago

Because, there are more Bots as in the 5 Cloud IP list, and this are also not detectable via the User Agent navigator.userAgent or the information inside the navigator.userAgentData.

My experiences with this matter is, that most of this kind of bots have a "Direct view" and less seconds "Visit duration".

To exclude this bots, a filter rule with user setted settings is needed like.

Exclude Viewer with (example): Direct View = true Visit Duration < 10 seconds

// setting:
$tsp_filter_start_seconds = get_tsp_filter_start_seconds(); // example: 10 seconds
$tsp_filter_direct_view = get_tsp_filter_direct_view(); // example: true

$view_visit_duration = get_view_visit_duration(); // example: 5 seconds

$track_bool = true;

if (($tsp_filter_direct_view === true) && ($view_visit_duration < $tsp_filter_start_seconds)) {
  $track_bool = false;
}

if ($track_bool === true) {
  // track the visit
} else {
  // don't track the visit
  // the example is here, because:
  // direct view: true
  // start seconds: 10
  // visit duration: 5
}

Furthermore, expand this filter:

Exclude Viewer with (example): Direct View = false / true Referrer = Google / Bing / Wikipedia Visit Duration < x seconds

So, if the referrer is "Google" and the Visit Duration is less than x seconds: dont' track. This prevent tracking of speedy Website-Hopper (Google → Website[1] →back to Google → Website[2] →back to Google → Website[3] →back to Google ... ).

snake14 commented 8 months ago

Hi @matomoto . Thank you for this enhancement suggestion. It sounds like some good ideas of potential spam criteria that could be added. I am marking this to be reviewed and prioritised by our Product team.

AltamashShaikh commented 8 months ago

@matomoto Do you use DeviceDetector plugin along with TrackingSpamPrevention plugin ?

matomoto commented 8 months ago

@AltamashShaikh In my Matomo instance is the DevicesDetection (Core) is Active and the Tracking Spam Prevention (TSP) Plugin is not installed. The TSP is until now not in my interest. I have a little bit own code (PHP/JS) active to prevent website hopper and cloud bots and headless browser. I observe the TSP to date a long time. I had missed (to date) a little bit informations about the plugin and functions. The plugin also needs few enhancement.

Of the basis of new informations: This issue here is not for a prevent the saving of this viewer in the database (because its not possible), but is for prevent to block this viewer by the creating of the reports.

My solution in this issue is for some time a prevent of loading the Matomo Tracking code to x seconds at the entry page only (not for further pages). The (example) code snippet is here published: https://forum.matomo.org/t/erfahrungsbericht-matomo-tracking-rauschen/46151/30 The problem is, that with this methode, the visit time has a lag of the x seconds on the entry pages. The methode is also not optimal, but it has the advantage to prevent the saving of this viewer in the database. Both in one would be perfect, but probably not possible to include it in matomo.

AltamashShaikh commented 8 months ago

@matomoto Di you try using the timer trigger in MatomoTagManager ? That could solve your problem maybe, but I don't see why we should add this in TrackingSpam as it a short lived user and that would be useful for many to determine the bounce rate.

Screenshot from 2024-03-13 05-36-45

AltamashShaikh commented 7 months ago

@matomoto Can we close this issue now ?

matomoto commented 7 months ago

@AltamashShaikh I'm still thinking about it. I don't use the Tag Manager and I'm not going to use it either. Respectivelly it is not possible to connect the Trigger/Timer rule with "Direct view" and only on the Entry Pages. https://help.piwik.pro/support/tag-manager/time-on-a-website-trigger/ Matomo core thinking about to decrease the bounce rate, but for long time and no result. The matomo core is not the right place. The Tracking Spam Prevention Plugin is it.

AltamashShaikh commented 7 months ago

@matomoto Can you please explain how could this short visit be spam for everyone ? The reason for short visit could be many things and just counting that as spam doesn't seem right to me.

matomoto commented 7 months ago

@AltamashShaikh Yes, it's not obvious. It results from experience. I have been making the effort to check the IPs of these viewers (Direct view, only 1 page, few seconds visit time) for some time now. They are almost always not IPs from providers, but from hosters or similar companies (secure check).

Website hoppers are not included because they have a referrer (i.e. no direct view). Some Matomo users only like real viewers and do not count website hoppers that are only on the website for a few seconds. But that is a different topic. This is about direct view. These are almost always bots.

In my experience, the "Direct view, only 1 page, few seconds visit time" viewers are mostly bots, just like the cloud bots, only from other clouds/servers. They come very regularly.

AltamashShaikh commented 7 months ago

@matomoto Can you share a list of IPs that are of hosters for our reference?

matomoto commented 7 months ago

@AltamashShaikh , yes, but it's not filtered to only this bots and the IPs are shortened, and it's not up to date (more up to month). It's my privat collection.

23.95.251.0/24
27.115.0.0/17
34.64.0.0/10
34.133.64.0/20
34.135.0.0/20
34.172.0.0/17
35.192.0.0/11
35.184.192.0/20
35.226.80.0/20
35.238.64.0/20
35.239.128.0/20
37.19.211.0/24
42.224.0.0/12
54.190.0.0/16
54.212.0.0/16
54.218.0.0/17
65.128.0.0/11
69.160.160.0/24
79.125.0.0/18
82.165.0.0/16
83.149.64.0/18
102.165.0.0/18
136.244.80.0/20
146.148.0.0/17
156.146.49.0/24
163.116.136.0/24
174.235.48.0/20
176.9.46.0/23
180.160.0.0/13
204.101.0.0/16
205.169.39.0/24
207.102.0.0/16
209.170.64.0/18
2a03:2880:1000::/36
2a03:2880:2000::/36
2600:3c00::/32
2600:3c01::/32
2600:4040:4000::/36
2604:f440::/48
2606:54c0::/32
23.229.*.*
23.236.*.*
24.235.*.*
38.68.*.*
38.152.*.*
38.154.*.*
38.170.*.*
66.84.*.*
66.146.*.*
68.65.*.*
68.234.*.*
69.58.*.*
74.84.*.*
93.104.*.*
111.7.*.*
123.6.*.*
138.229.*.*
139.180.*.*
141.164.*.*
142.147.*.*
149.20.*.*
148.59.*.*
152.44.*.*
154.13.*.*
156.252.*.*
162.244.*.*
167.160.*.*
168.91.*.*
172.81.*.*
172.96.*.*
172.245.*.*
192.149.*.*
192.171.*.*
192.186.*.*
192.198.*.*
192.210.*.*
198.20.*.*
198.245.*.*
199.34.*.*
199.250.*.*
205.185.*.*
206.198.*.*
207.182.*.*
208.103.*.*
209.251.*.*
211.95.*.*
101.227.*.*
101.67.*.*
107.127.*.*
142.132.*.*
189.217.*.*
209.141.*.*
216.180.*.*
216.213.*.*
AltamashShaikh commented 7 months ago

@matomoto Thank's for sharing the ips it looks like some of the ips belong to service:datacenter. But on checking quickly it doesn't look like we get this info from the GEOIP db, if we were getting this info, we could have added an option to exclude visit from certain services. The delay part cannot be added in this plugin, for that you need to use a custom approach or the TagManager approach as I shared in above comment.

Screenshot from 2024-03-21 07-29-55

matomoto commented 7 months ago

@AltamashShaikh You forget, that I have always a own solution (with JavaScript), and other user need it. So, include it in a plugin is a better way as own JavaScript code, because not all users can handle with the JavaScript code solution. And in the end, it is only an option that can used by plugin users. More than Please I can't say here.

matomoto commented 7 months ago

I have found a free datacenter IP list: https://github.com/growlfm/ipcat This includes also few of the Cloud IP Lists/Ranges. Unfortunatelly the (example) DataCamp Limited is not included, but many others.

There are more such lists available, but only for sale.

AltamashShaikh commented 7 months ago

@AltamashShaikh You forget, that I have always a own solution (with JavaScript), and other user need it. So, include it in a plugin is a better way as own JavaScript code, because not all users can handle with the JavaScript code solution. And in the end, it is only an option that can used by plugin users. More than Please I can't say here.

@matomoto We cannot add that delay JS code in this plugin as its not designed that way..I can keep this issue open and change the title to "Add exclusion based on services by I/P" and in-future if the geoip DB starts returning the data, we can implement this feature, for now I would recommend using your JS implementation