matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.63k stars 2.62k forks source link

207GB archive_blob table contributing the an overall bloated database of over 600GB #22438

Open mark-webster-catalyst opened 1 month ago

mark-webster-catalyst commented 1 month ago

One of my archive_blob tables is truly massive, it's 207GB. When I run console diagnostics:analyze-archive-table 2023_01 it takes a few hours to run, I get lots of entries that look relatively normal, but then at the bottom I get the following:

+--------------------------------------------+------------+---------------+-------------+---------+-----------+----------------+-------------+-------------+
| Group                                      | # Archives | # Invalidated | # Temporary | # Error | # Segment | # Numeric Rows | # Blob Rows | # Blob Data |
+--------------------------------------------+------------+---------------+-------------+---------+-----------+----------------+-------------+-------------+
[...]
| week[2023-01-16 - 2023-01-22] idSite = 56  | 12         | 168           | -           | -       | -         | -              | -           | -           |
| year[2023-01-01 - 2023-12-31] idSite = 56  | 432        | 2377474       | -           | -       | -         | -              | -           | -           |
| year[2023-01-01 - 2023-12-31] idSite = 61  | 303        | 1571421       | -           | -       | -         | -              | -           | -           |

and the task exits with:

Uncaught exception: TypeError: Unsupported operand types: int + string in /bitnami/matomo/plugins/Diagnostics/Commands/AnalyzeArchiveTable.php:69
Stack trace:
#0 /opt/bitnami/matomo/vendor/symfony/console/Symfony/Component/Console/Command/Command.php(257): Piwik\Plugins\Diagnostics\Commands\AnalyzeArchiveTable->execute()
#1 /opt/bitnami/matomo/vendor/symfony/console/Symfony/Component/Console/Application.php(874): Symfony\Component\Console\Command\Command->run()
#2 /opt/bitnami/matomo/vendor/symfony/console/Symfony/Component/Console/Application.php(195): Symfony\Component\Console\Application->doRunCommand()
#3 [internal function]: Symfony\Component\Console\Application->doRun()
#4 /opt/bitnami/matomo/core/Console.php(135): call_user_func()
#5 /opt/bitnami/matomo/core/Access.php(670): Piwik\Console->Piwik\{closure}()
#6 /opt/bitnami/matomo/core/Console.php(136): Piwik\Access::doAsSuperUser()
#7 /opt/bitnami/matomo/core/Console.php(87): Piwik\Console->doRunImpl()
#8 /opt/bitnami/matomo/vendor/symfony/console/Symfony/Component/Console/Application.php(126): Piwik\Console->doRun()
#9 /opt/bitnami/matomo/console(32): Symfony\Component\Console\Application->run()
#10 {main}
Uncaught exception in /bitnami/matomo/plugins/Diagnostics/Commands/AnalyzeArchiveTable.php line 69:
Unsupported operand types: int + string

Sites with those IDs don't exist.

I've also performed core:purge-old-archive-data. That took over 4 hours to optimize just the largest table. Made no difference to the size.

Could this be contributing to my huge table? The DB itself is over 600GB and I'm trying to rein it in!

sgiehl commented 1 month ago

@mark-webster-catalyst We lately came across some bugs in archiving that might cause data pollution. Maybe they also affect you. Could you provide some further details on you set up and instance? Like how many visits/actions are you tracking per month on average, is archiving handled by one or multiple servers, how often is archiving been triggered, how long does archiving normally run...

mark-webster-catalyst commented 1 month ago

Hi, sure.

We're on a fairly old version, 4.12.3, for various "reasons".... It's running in kubernetes with 2 pods. We run archiving on cron every hour, if I run the archiving job manually it completes very quickly.

We mostly deal with higher education platforms so the monthly tracking varies between our busiest of: 14,944,598 visits 114,484,326 pageviews 122,074,631 actions

down to the following in July, when everyone is away for the summer: 3,063,882 visits 36,043,032 pageviews 37,656,929 actions

Could both pods trying to do the same thing on cron be affecting it?

sgiehl commented 1 month ago

@mark-webster-catalyst Ok, I see, that's a bigger instance.

Could both pods trying to do the same thing on cron be affecting it? Yes. We have discovered some concurrency issues in Matomo 5 we are currently preparing fixes for.

I would recommend updating to Matomo 5 at some point. There were a lot improvements around archiving and the next release will even include a couple of more fixes, also around data pollution. Matomo 5 btw didn't include any bigger database updates, so even updating a bigger instance shouldn't take tooo long. Might still be better to maybe test it upfront 🙈

mark-webster-catalyst commented 1 month ago

Thank you for the response. I'll try scaling to a single larger pod and see if that works (we use an SQS queue to store all the tracking events, previously a single instance had trouble keeping up, but I can try again). If not I'll look into a separate cron deployment that's not exposed by the service.

I'm aware it's a pretty big instance, so I'm not expecting miracles, but that one table is 3 times the size of all but one other table in the DB, so it kinda stands out!

My log_link_visit_action table is also over 200GB, not sure if that's normal or not? Currently running core:fix-duplicate-log-actions and it took an hour and a half to fix one duplicate. It found 10692... So I'll let you know if that worked in 2 years I guess :laughing:

sgiehl commented 1 month ago

@mark-webster-catalyst to avoid double archiving and data pollution it might be better in your case to run the archiving only on one machine for now (Until you updated to the version we will release somewhen the next weeks).

The archive table for January is always a bit bigger, as it also contains the year archives. But 3 times bigger seems bit too much. We are currently working on a solution for the data pollution problem. Once there are more details, we will let you know. There should also be ways to clean that manually running certain SQL queries.

mark-webster-catalyst commented 1 month ago

@sgiehl Thank you. I'll scale it down for right now and await the release before embarking on an upgrade mission.