Open mark-webster-catalyst opened 1 month ago
@mark-webster-catalyst We lately came across some bugs in archiving that might cause data pollution. Maybe they also affect you. Could you provide some further details on you set up and instance? Like how many visits/actions are you tracking per month on average, is archiving handled by one or multiple servers, how often is archiving been triggered, how long does archiving normally run...
Hi, sure.
We're on a fairly old version, 4.12.3, for various "reasons".... It's running in kubernetes with 2 pods. We run archiving on cron every hour, if I run the archiving job manually it completes very quickly.
We mostly deal with higher education platforms so the monthly tracking varies between our busiest of: 14,944,598 visits 114,484,326 pageviews 122,074,631 actions
down to the following in July, when everyone is away for the summer: 3,063,882 visits 36,043,032 pageviews 37,656,929 actions
Could both pods trying to do the same thing on cron be affecting it?
@mark-webster-catalyst Ok, I see, that's a bigger instance.
Could both pods trying to do the same thing on cron be affecting it? Yes. We have discovered some concurrency issues in Matomo 5 we are currently preparing fixes for.
I would recommend updating to Matomo 5 at some point. There were a lot improvements around archiving and the next release will even include a couple of more fixes, also around data pollution. Matomo 5 btw didn't include any bigger database updates, so even updating a bigger instance shouldn't take tooo long. Might still be better to maybe test it upfront 🙈
Thank you for the response. I'll try scaling to a single larger pod and see if that works (we use an SQS queue to store all the tracking events, previously a single instance had trouble keeping up, but I can try again). If not I'll look into a separate cron deployment that's not exposed by the service.
I'm aware it's a pretty big instance, so I'm not expecting miracles, but that one table is 3 times the size of all but one other table in the DB, so it kinda stands out!
My log_link_visit_action
table is also over 200GB, not sure if that's normal or not? Currently running core:fix-duplicate-log-actions
and it took an hour and a half to fix one duplicate. It found 10692... So I'll let you know if that worked in 2 years I guess :laughing:
@mark-webster-catalyst to avoid double archiving and data pollution it might be better in your case to run the archiving only on one machine for now (Until you updated to the version we will release somewhen the next weeks).
The archive table for January is always a bit bigger, as it also contains the year archives. But 3 times bigger seems bit too much. We are currently working on a solution for the data pollution problem. Once there are more details, we will let you know. There should also be ways to clean that manually running certain SQL queries.
@sgiehl Thank you. I'll scale it down for right now and await the release before embarking on an upgrade mission.
One of my archive_blob tables is truly massive, it's 207GB. When I run
console diagnostics:analyze-archive-table 2023_01
it takes a few hours to run, I get lots of entries that look relatively normal, but then at the bottom I get the following:and the task exits with:
Sites with those IDs don't exist.
I've also performed
core:purge-old-archive-data
. That took over 4 hours to optimize just the largest table. Made no difference to the size.Could this be contributing to my huge table? The DB itself is over 600GB and I'm trying to rein it in!