kiwix / operations

Kiwix Kubernetes Cluster
http://charts.k8s.kiwix.org/
6 stars 0 forks source link

Matomo purge strategy #85

Closed rgaudin closed 1 year ago

rgaudin commented 1 year ago

Matomo's database grows bigger and bigger over time. At the moment, the whole MySQL folder is 84GiB large ; 68Gib being the actual data (rest is MySQL binlog files). Over 54GiB is composed of what I believe are raw visit data (piwik_log_link_visit_action table).

Ah actually there's a plugin that displays DB Stats

Screenshot 2023-05-05 at 16 07 56

matomo's doc on disk space management is all about purging old data, being either old raw data or old reports.

The difficulty being choosing what (raw and/or reports) to purge and how-old data to purge. The reasons for keeping this data are all explained in the doc.

It's important we document here what we want to do.

I have no personal suggestion regarding this as I am not a matomo user myself and am not familiar with our potential future data needs for past data.

Popolechien commented 1 year ago

We do not have the bandwidth to do serious data mining on these logs, nor are we looking for new angles that wouldn't be analysed or aggregated within the vanilla reports. I would therefore suggest we remove old raw data to keep reports and metrics only (not sure what is filed as Other tables but I don't think it's strategic).

The only question remaining to me is when does old start? Beyond two years?

Popolechien commented 1 year ago

Following our discussion I'd recommend making sure we keep raw data for the last full calendar year (1.01 - 31.12) rather than the past 12 months and make sure that reports are generated.

rgaudin commented 1 year ago

As it didn't go smoothly, here's the post-mortem:

Wednesday May 31st, I applied the new strategy and ran the purge via the UI. Didn't work.

I then ran the core:purge-old-archive-data command, as per the doc directly on the shell. Worked.

Size on filesystem didn't reduced though. Rows were deleted but space not reclaimed. Actually, it increased a little.

I then ran manually the OPTIMIZE TABLE query on the main table (pwiki_log_link_visit_action) but the container crashed in-course. This resulted in the large table to be corrupt.

What apparently happened is that k8s evicted the pod (and others) due to lack of ephemeral storage (different than volume storage but it's on the same physical disk on the node).

I then took down every other services and jobs on that node (matomo app, matomo web, metrics, zimfarm monitoring and the matomo jobs) so most of the node's resources would be dedicated to the DB.

Repairing the table took multiple attempts and required making some extra space on the disk. Once it repaired OK, the optimize query ran properly as well.

The service was brought up but it could not be reached.

We faced again that issue of incorrect firewall rules: the node's 80/443 ports are not forwarded to the k8s (the nginx DaemonSet's pod). Because no other service was running and because I knew it would fix it, I quickly resorted to rebooting the server.

It worked and the service was back up.

I then manually re-imported the logs for download.kiwix, download.openzim and library.kiwix.

We did not loose any stats data for those three sites (because those are based on web-server logs) but other websites rely on matomo's JS call to the service and given the service was down, stats were not collected. So none or very few stats for all other services for May 31st and June 1st.

Question: since we now have plenty of time ; we could disable the automatic (monthly) purge of 730d+ raw data. Should we?

rgaudin commented 1 year ago
Screenshot 2023-06-01 at 13 50 09
Popolechien commented 1 year ago

since we now have plenty of time ; we could disable the automatic (monthly) purge of 730d+ raw data.

I am not sure I understand the rationale for disabling it, unless you mean to say that this problem is bound to become monthly. Other than that, a yearly clean up could also do, couldn't it? When was last time we had one (or when was matomo set up)?

rgaudin commented 1 year ago

I meant plenty of space 😅 As of today, we have 2years of raw data. In a month (and every month after that), we'll still have 2 years of raw data. Because now we have more space available, we could disable this automatic purge so that in 2 months, we'll have 50mo of raw data and in a year we'll have 3y of raw data.

kelson42 commented 1 year ago

@rgaudin I don't understand your explanation neither what are the pro/cons to do that move. Let discuss this directly.