matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.8k stars 2.64k forks source link

Keep (some) raw data to regenerate extracted values in log_visit #8955

Open ThaDafinser opened 9 years ago

ThaDafinser commented 9 years ago

The log_visit has currently many processed visitor data

All those data has one common thing: They are extracted from the $_SERVER (or similar) request data and then the original data are lost.

I think it should be possible to keep the raw data for reprocess the filling of the processed data.

Why? Lets take for example the device. Its value getting filled by the wonderful device-detector which gets better and better. But i'm not able after a device-detector upgrade to fill the missing values, because the original value is lost.

I think of a very simple solution:

Drawback: The size per entry takes a few amount of (k)bytes...

Thoughts?

tsteur commented 9 years ago

:+1: :+1: I don't think amount of bytes is a big deal nowadays and if so, when can still disable it or clear it, setup delete logs etc.

ThaDafinser commented 9 years ago

The only thing i remember, when i've done that some years ago im my custom little log/analyze table was that the $_SERVER variable can get really huge, when including all.

So it should be limited to the currently useful parts.

mattab commented 8 years ago

maybe we could start reducing scope to store the user agent raw value in the visit, assuming the user agent is the most useful field.

we could not store the whole of _SERVER as we need to make sure privacy is respected, and that fields are properly sanitised such as IP address.

hpvd commented 8 years ago

+1 for keeping raw data!

Very same direction keeping raw data there are also some other topics very intersting:

Possibility to give visits a type like "standard", "deleted", "bot" etc. https://github.com/piwik/piwik/issues/9205

Do not delete bots but make them filterable afterwards (simple switch include or ignore them) https://github.com/piwik/piwik/issues/9067

centralized list to store visitis to ignore: bots, deleted visits, spam etc. https://github.com/piwik/piwik/issues/9184

(...and storage is becoming cheaper and faster every day, but visitor count (data production) on websites tracked with piwik is not enhancing with same speed)

ThaDafinser commented 8 years ago

I added a really simple plugin to add a column with the serialized HTTP headers https://github.com/ThaDafinser/Piwik-KeepVisitorHttpRawData/blob/master/Columns/KeepVisitorHttpRawData.php#L36-L51

_NOTE_ It does not care currently about the privacy settings Nor there is already a job to reparse the headers, after an update.

It's just here for now, to get a feeling about the needed memory