matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.7k stars 2.62k forks source link

[Bug] Incorrect aggregated info for imported data #21496

Open vkovalcik opened 10 months ago

vkovalcik commented 10 months ago

What happened?

After importing data from the Google Analytics, even if everything was invalidated, the core:archive seem to ignore the actual day data and create week archives with 0 visits (same for month and year archives). Not ALL week data are 0 though. There is usually one week per month with some data in it (I guess copied just from a single day or two).

When I view such a data in web UI and select Custom Range with up to six days, all seems to be working and total stats are correct.

When I select 7 days (even it is from Wednesday to the next Tuesday), it shows "0" for most stats.

See this difference in stats below charts. Working one:

Screenshot 2023-11-04 at 18-51-08 pastel cz - From 2016-11-03 to 2016-11-08 - Web Analytics Reports - Matomo

And a wrong one with the same days plus one more: Screenshot 2023-11-04 at 18-51-24 pastel cz - From 2016-11-03 to 2016-11-09 - Web Analytics Reports - Matomo

What should happen?

There should be correct week stats in the database and correctly show stats in the UI. (I am not sure if these are not two separate issues, but perhaps not.)

How can this be reproduced?

This will be tough :/ I did it like this:

Imported Google Analytics data from 2007 to 2023 on a different computer (but with Matomo and DB settings copied) and had it under different sideid. I migrated from this "testing" DB to DB on with my live stats and changed siteid in phpMyAdmin, so there might be some points of possible failure, but it seems to be working on the day-level.

Then I invalidated all the reports using the Invalidate plugin (which as per docs seems to keep day data as they there are no logs from the respective days).

Then I ran core:archive, which outputted a lot of “Archiving week XY: 0 total visits” even though that is incorrect.

Matomo Version

Matomo 4

Matomo Patch or Minor Version

4.15.1

PHP Version

8.1.18

Server Operating System

Debian GNU/Linux 9

What browsers are you seeing the problem on?

Firefox

Computer Operating System

Windows 10

Relevant log output

This is the output of core:archive. This part shows the dates around the chart, but other dates are similar.

INFO [2023-11-02 13:52:39] 13721  Archived website id 1, period = month, date = 2016-12-01, segment = '', 46 visits found. Time elapsed: 1.822s
INFO [2023-11-02 13:52:39] 13721  Archived website id 1, period = week, date = 2016-11-21, segment = '', 0 visits found. Time elapsed: 1.824s
INFO [2023-11-02 13:52:39] 13721  Archived website id 1, period = week, date = 2016-11-28, segment = '', 71 visits found. Time elapsed: 0.574s
INFO [2023-11-02 13:52:39] 13721  Archived website id 1, period = week, date = 2016-11-14, segment = '', 0 visits found. Time elapsed: 0.576s
INFO [2023-11-02 13:52:39] 13721  Archived website id 1, period = week, date = 2016-11-07, segment = '', 0 visits found. Time elapsed: 0.579s
INFO [2023-11-02 13:52:40] 13721  Archived website id 1, period = month, date = 2016-11-01, segment = '', 184 visits found. Time elapsed: 0.752s
INFO [2023-11-02 13:52:40] 13721  Archived website id 1, period = week, date = 2016-10-24, segment = '', 0 visits found. Time elapsed: 0.754s
INFO [2023-11-02 13:52:40] 13721  Archived website id 1, period = week, date = 2016-10-17, segment = '', 0 visits found. Time elapsed: 0.756s
INFO [2023-11-02 13:52:41] 13721  Archived website id 1, period = week, date = 2016-10-31, segment = '', 71 visits found. Time elapsed: 0.621s

Validations

MatomoForumNotifications commented 10 months ago

This issue has been mentioned on Matomo forums. There might be relevant details there:

https://forum.matomo.org/t/incorrect-aggregated-info-for-imported-data/53897/5

Stan-vw commented 10 months ago

Keen to hear if more people have this issue. Sounds like it might be a result of the very specific setup. If more people have this problem it can help us understand the problem and subsequent prioritisation.

vkovalcik commented 10 months ago

I completely understand. I am waiting for the final release of Matomo 5 to upgrade and see if the bug won't accidentally vanish :) If not, I would try to go a bit deeper into this to see if I can find something interesting.

vkovalcik commented 9 months ago

EDIT: I have detached the bug from this single comment to a separate issue: https://github.com/matomo-org/matomo/issues/21808

OLD: After a lot of digging in the code and messing with phpMyAdmin I think I found the underlying issue:

In some cases the core:invalidate-report-data invalidates even such day data for which there are no logs. (Furthermore, those data are subsequently deleted, probably during some maintenance operations... fortunately, I have some backups).

What exactly happens:

During invalidation in ArchiveInvalidator::findOlderDateWithLogs() it is checked whether for the archive there are actually log data, but the check is only done using the number of days from the "Delete logs when older than..." option. If the user is trying to invalidate older archives than the specified number of days, the invalidation doesn't proceed. However, since for me this option is essentially set to infinity, the check always succeeds and the archives are happily invalidated even there are no matching logs.

I guess the actual minimum date of the entries in the logs should be used instead.

As a current solution, I will use core:invalidate-report-data with --periods=week,month,year The InvalidateReports plugin have no such options, so it is pretty dangerous.

vkovalcik commented 9 months ago

And there is also ANOTHER bug/weird behaviour:

Through a sequence of actions I got table that sometimes contain same "archiveid" number for different sets of siteid+date1+date2+period, which I believe shouldn't be possible.

matomo_archiveid

(Again I expect that when I now invalidate weeks, it will use this archiveid and mark also some days as invalidated and delete them)

I did this:

So far I guess there is something weird with Sequence and getting a new archiveid, but I wasn't able to understand the inner workings.

I can send you the table before the attempt to invalidate and archive it ... I would rather not share it publicly and send it privately, if that is an option.

vkovalcik commented 8 months ago

I have detached the FIRST bug to its separate issue: https://github.com/matomo-org/matomo/issues/21808

As noted in the previous comment, there is probably at least one other bug, not related to that one.

vkovalcik commented 8 months ago

Sorry mis-click, I didn't mean to close it.

vkovalcik commented 8 months ago

I suspect I might know the cause of the wrong archiveid numbers in the archives:

It was caused by merging archives from one Matomo installation to another by JUST taking the archive tables and copying it to the other database (while adjusting siteid). The problem is that I didn't copy contents of the _sequence table, which is undocumented, but seems to be very critical for the archives. It probably contains last used archiveid in each archive and from there the new archiveid is generated. If this table is missing or contains wrong (low) numbers, I guess the archiveid starts from the 0/1 again and can mistakenly use numbers that are already used.

My proposal is to:

This is probably not the end of my journey :) There might at least one more issue considering the blobs. But for it I need to get more information, so for now I formulated a question on the forum: https://forum.matomo.org/t/merging-archive-data-what-exactly-is-in-blobs/54981