matomo-org / matomo-log-analytics

Import any kind of server logs in Matomo for powerful log analytics. Universal log file parsing and reporting.
https://matomo.org/log-analytics/
GNU General Public License v3.0
224 stars 118 forks source link

Custom dimensions in Log Analytics to replace custom variables #290

Open sgiehl opened 3 years ago

sgiehl commented 3 years ago

This is currently a port of #214 to 4.x-dev branch.

Will start applying the suggested changes here

sgiehl commented 3 years ago

Since no one else commented, I'll put my opinions:

Sure! Would you have some specs of which parameters you'd expect to choose between custom vars Vs. custom dims?

Since the default behavior before was to always add them as custom variables, I think that should be retained. Then there could be a new option like --use-custom-dimensions (can't think of a better name) that would use custom dimensions instead.

Note that these are still supported. I just amended the documentation with DEPRECATED.

Gotcha, :+1:

There I mimicked the custom variables behavior. Before this PR custom variables were implicitly consumed to store Bot/Not-Bot/HTTP-Code/HTTP-Method. So there is probably a need to further amend the CLI, and make things more explicit. I can take care of amending it you can come up with exact specs of the CLI you think would be the most correct.

Here are some ideas:

  • an --auto-create-dimensions option that when specified will try to create dimensions. if not supplied, and a dimension doesn't exist, we throw an exception. would probably be good in this case to check for each dimension's existence when starting the script, so we don't end up importing some logs & then failing. Maybe it would also be a good idea to check for the dimensions specified in --regex-group-to-...-cdim options, in case of a typo.
  • ... actually that should do it I think, don't need more options.

Thoughts?

Re: tests in PHP, I am currently slower on PHP than Python, so I'd appreciate help on that side 😄

No worries, the system tests in matomo are particularly unfun to newcomers, and I wouldn't want you to stop contributing :)

@diosmosis that was you last comment on #214, which aimed to fix #188 Was about to apply your suggested changes, but then I was wondering if it wouldn't be better to break BC here. Imho it might be better to switch using custom dimensions by default, but maybe provide an option --use-custom-variables to switch back to the old behavior. If I got that right the target was to get rid of custom variables and replace them with custom dimensions. If we don't do that by default new users would likely still use custom variables.

Checking if all custom dimensions exists before the log is being parsed isn't possible if the dynamic resolver is used, as it's determined for each single tracking request which site_id is being used. Not sure how to handle that.

Also I was wondering if we should completely remove the custom variable / dimension for Bot / Non-Bot. Those variables currently contain the user agent as value. Thinking of GDPR I think that's actually a no-go as it's considered as personal information. And that behavior currently can't even be disabled as soon as tracking bots is enabled. In addition the bot detection in log analytics is very rudimentary and inaccurate. It's good in order to filter away the most common bots, but everything else is already done in Matomo using the device detector. If anyone want's to have a custom dimension to identify bots it would be better to do that in a simple custom plugin instead.

Tracking the HTTP-Status-Code is also automatically done. I would actually do that on demand only or at least make it possible to deactivate that

@tsteur @mattab any opinion on that?

mattab commented 3 years ago

Was about to apply your suggested changes, but then I was wondering if it wouldn't be better to break BC here.

IMHO @sgiehl it would be acceptable to break BC for this. (it would have been better to have included this in the 4.0.0 release already but it's still fine). Just need to document it clearly in the chnagelog (can you create a new issue specifically to announce the BC break more clearly?)

Also I was wondering if we should completely remove the custom variable / dimension for Bot / Non-Bot. Those variables currently contain the user agent as value.

Thanks for your explanation. Sounds good to remove the user agent value from it. Would it be possible to still have a Custom Dimension "Bot" set to 0 or 1 based on device detector detection? it would be quite helpful to still be able to segment out the "bots" visits, since we know so much of the traffic is from bot, and people need/want to see the non-bot traffic (which they could do by creating a segment Bot is 0)

Tracking the HTTP-Status-Code is also automatically done. I would actually do that on demand only or at least make it possible to deactivate that

Is there a potential issue to do it automatically? if not, then it might be easier to leave it tracked automatically (and means we don't have to implement a new flag/parameter to the tool to disable the feature)

diosmosis commented 3 years ago

Checking if all custom dimensions exists before the log is being parsed isn't possible if the dynamic resolver is used, as it's determined for each single tracking request which site_id is being used. Not sure how to handle that.

Some ideas:

sgiehl commented 3 years ago

@mattab

Thanks for your explanation. Sounds good to remove the user agent value from it. Would it be possible to still have a Custom Dimension "Bot" set to 0 or 1 based on device detector detection? it would be quite helpful to still be able to segment out the "bots" visits, since we know so much of the traffic is from bot, and people need/want to see the non-bot traffic (which they could do by creating a segment Bot is 0)

Sure, we could do that. But as I explained before, that would be very inaccurate, as the log analytics script only detects a generic set of bots. Matomo does actually detect a lot as it uses the device detector. So if we want to provide an easy way to enable bot tracking we could create a simply plugin that only has that one dimension and is filled with the result of device detector.

Is there a potential issue to do it automatically? if not, then it might be easier to leave it tracked automatically (and means we don't have to implement a new flag/parameter to the tool to disable the feature)

It's not an issue, but we should at least give the possibility to disable it, as it otherwise might consume a custom dimension, where people might need it for something else maybe.

@diosmosis

Some ideas:

  • check when a new site id is encountered, on failure don't track that site and inform user they need to run the command again, while filtering for the specific sites (not sure if an option already exists for that)
  • require custom dim options to mapped to a specific idsite (or set of idsites). so not just mapping regex to custom dimension slot, but also idsite?

Imho that sounds a bit too complex. Maybe we should simply ignore not existing dimensions if the auto-create option is not given. Everything else might also possibly break automatic imports at some point. E.g. when someone removes a custom dimension in the UI, a log import might fail, as the dimension can't be found anymore.

So we should imho document that custom dimensions with specific names are required to import those values (unless the auto-import option is set)

diosmosis commented 3 years ago

Imho that sounds a bit too complex. Maybe we should simply ignore not existing dimensions if the auto-create option is not given.

The only potential problem I see w/ that is that if a user doesn't realize this or doesn't realize that those custom dimensions shouldn't be applied to every log, and the log importer fails silently, then they will have data they don't expect. So they may not realize their data is broken, may create bug reports, and when they realize the problem, will have to re-import, which could be a very complicated and long process depending on how they imported in the first place (and can break some data that requires logs be imported in order). This seems like it could lead to a very poor UX for some users.