matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.68k stars 2.62k forks source link

Report archives have tripled in size since update to 2.10 #7181

Closed skyhawk669 closed 8 years ago

skyhawk669 commented 9 years ago

Since upgrading to 2.10 the archive blobs and archive numbers tables have tripled in size (blob tables usually between 30-40MB, now they are 140-500MB).

The archiving process is set in cron to run every hour and nothing else has changed in the system beyond upgrading from 2.9 to 2.10 (no drastic change in amount of visitors, or to the structure of the site).

The tables are reduced in size a bit by running core:run-scheduled-tasks --force, but they're still quite a bit bigger than they used to be.

More background info in the following thread: http://forum.piwik.org/read.php?2,123852

Running Piwik on: Red Hat 5 apache 2.4 PHP 5.4.35 MySQL 5.0.45

PatrickChen commented 9 years ago

after running mysqlcheck -o piwikdb piwik_archive_blob_2015_01 -u root -p on all the bloated blob and numeric tables since January 2015, they've all gone down to the appropriate size. For reference, blob_2015_01 dropped from 4gb to 753mb, which is in line with the expected linear growth of the monthly db tables. what's interesting is that this was far more effective than OPTIMIZE TABLE piwik_archive_blob_2015_01; which only shaved off maybe 10-20% of the ballooned size, in contrast to mysqlcheck shaving off 70%+ and bringing it to the appropriate size. will reply again in a few days to see if the the tables balloon again or not while on this 2.12.0-rc2

skyhawk669 commented 9 years ago

@diosmosis I am using cron archiving (console core:archive). I have included the output of the latest cron run below (it shows no scheduled tasks being run).

I'm using MySQL 5.0.45

INFO CoreConsole[2015-03-24 18:05:01] ---------------------------
INFO CoreConsole[2015-03-24 18:05:01] INIT
INFO CoreConsole[2015-03-24 18:05:01] Piwik is installed at: https://archive.austinisd.org/track/index.php
INFO CoreConsole[2015-03-24 18:05:01] Running Piwik 2.12.0-rc2 as Super User
INFO CoreConsole[2015-03-24 18:05:02] ---------------------------
INFO CoreConsole[2015-03-24 18:05:02] NOTES
INFO CoreConsole[2015-03-24 18:05:02] - Reports for today will be processed at most every 3600 seconds. You can change this value in Piwik UI > Settings > General Settings.
INFO CoreConsole[2015-03-24 18:05:02] - Reports for the current week/month/year will be refreshed at most every 3600 seconds.
INFO CoreConsole[2015-03-24 18:05:02] - Archiving was last executed without error 59 min 59s ago
INFO CoreConsole[2015-03-24 18:05:02] - Will process 1 websites with new visits since 59 min 58s , IDs: 1
INFO CoreConsole[2015-03-24 18:05:02] ---------------------------
INFO CoreConsole[2015-03-24 18:05:02] START
INFO CoreConsole[2015-03-24 18:05:02] Starting Piwik reports archiving...
INFO CoreConsole[2015-03-24 18:05:22] Archived website id = 1, period = day, 48184 visits in last last2 days, 16717 visits today, Time elapsed: 20.584s
INFO CoreConsole[2015-03-24 18:05:28] Archived website id = 1, period = week, 92272 visits in last last2 weeks, 48184 visits this week, Time elapsed: 5.316s
INFO CoreConsole[2015-03-24 18:06:33] Archived website id = 1, period = month, 1361615 visits in last last2 months, 506440 visits this month, Time elapsed: 65.514s
INFO CoreConsole[2015-03-24 18:06:49] Archived website id = 1, period = year, 10625010 visits in last last2 years, 2095133 visits this year, Time elapsed: 15.511s
INFO CoreConsole[2015-03-24 18:06:49] Archived website id = 1, period = range, 68665 visits in last last7 ranges, 68665 visits this range, Time elapsed: 0.708s
INFO CoreConsole[2015-03-24 18:06:49] Archived website id = 1, 5 API requests, Time elapsed: 107.675s [1/1 done]
INFO CoreConsole[2015-03-24 18:06:49] Done archiving!
INFO CoreConsole[2015-03-24 18:06:49] ---------------------------
INFO CoreConsole[2015-03-24 18:06:49] SUMMARY
INFO CoreConsole[2015-03-24 18:06:49] Total visits for today across archived websites: 16717
INFO CoreConsole[2015-03-24 18:06:49] Archived today's reports for 1 websites
INFO CoreConsole[2015-03-24 18:06:49] Archived week/month/year for 1 websites
INFO CoreConsole[2015-03-24 18:06:49] Skipped 0 websites: no new visit since the last script execution
INFO CoreConsole[2015-03-24 18:06:49] Skipped 0 websites day archiving: existing daily reports are less than 3600 seconds old
INFO CoreConsole[2015-03-24 18:06:49] Skipped 0 websites week/month/year archiving: existing periods reports are less than 3600 seconds old
INFO CoreConsole[2015-03-24 18:06:49] Total API requests: 5
INFO CoreConsole[2015-03-24 18:06:49] done: 1/1 100%, 16717 vtoday, 1 wtoday, 1 wperiods, 5 req, 107777 ms, no error
INFO CoreConsole[2015-03-24 18:06:49] Time elapsed: 107.777s
INFO CoreConsole[2015-03-24 18:06:49] ---------------------------
INFO CoreConsole[2015-03-24 18:06:49] SCHEDULED TASKS
INFO CoreConsole[2015-03-24 18:06:49] Starting Scheduled tasks...
INFO CoreConsole[2015-03-24 18:06:50]  No task to run
INFO CoreConsole[2015-03-24 18:06:50] done
INFO CoreConsole[2015-03-24 18:06:50] ---------------------------
diosmosis commented 9 years ago

@skyhawk669 Many scheduled tasks run daily, so the core:archive command (which can be run hourly) may not always run the scheduled tasks. If you're able, can you look through the last core:archive runs (ie, w/i a 24 hour period) to see if there is any scheduled task output?

skyhawk669 commented 9 years ago

@diosmosis I found the following:

INFO CoreConsole[2015-03-25 00:10:43] SCHEDULED TASKS
INFO CoreConsole[2015-03-25 00:10:43] Starting Scheduled tasks...
INFO CoreConsole[2015-03-25 00:10:47] task,output
Piwik\Plugins\CoreAdminHome\Tasks.purgeOutdatedArchives,Time elapsed: 1.147s
Piwik\Plugins\CoreAdminHome\Tasks.purgeInvalidatedArchives,Time elapsed: 1.250s
Piwik\Plugins\PrivacyManager\Tasks.deleteReportData,Time elapsed: 0.006s
Piwik\Plugins\PrivacyManager\Tasks.deleteLogData,Time elapsed: 0.001s
Piwik\Plugins\CorePluginsAdmin\Tasks.clearAllCacheEntries,Time elapsed: 0.257s
Piwik\Plugins\CorePluginsAdmin\Tasks.sendNotificationIfUpdatesAvailable,Time elapsed
: 0.008s
Piwik\Plugins\CoreAdminHome\Tasks.optimizeArchiveTable,Time elapsed: 1.028s
Piwik\Plugins\CoreUpdater\Tasks.sendNotificationIfUpdateAvailable,Time elapsed: 0.14
0s
INFO CoreConsole[2015-03-25 00:10:47] done
INFO CoreConsole[2015-03-25 00:10:47] ---------------------------
skyhawk669 commented 9 years ago

I had to run core:purge-old-archive-data this morning to reduce the size of 2015-01 and 2015-03 tables. Curiously even though the tables were reduced after running that task, I had to run Optimize from phpMyAdmin to reduce their size a bit further.

PatrickChen commented 9 years ago

Apologies, how can i log my archive and scheduled task output? i've been running core:archive in the pattern of php "D:\www\piwik\console" core:archive --accept-invalid-ssl-certificate --url=https://piwik.example.org/ through a Windows Scheduled Task. Do I append something onto the command? Update: After 6 days on 2.12.0-rc2, my archive_blob/numeric_2015_01 and archive_blob/numeric_2015_03 are still growing at an unusually fast rate. The automated functions do not seem to have prevented the ballooning nor are cleaning it up. Update: Running core:purge-old-archive-data had no effect at all on blob/numeric files.

Purging outdated archives for 2015_03...Done. [Time elapsed: 1.094s] Purging invalidated archives for 2015_03...Done. [Time elapsed: 4.703s] Purging custom range archives for 2015_03...Done. [Time elapsed: 0.031s] Optimizing archive tables... Optimizing table piwik_archive_numeric_2015_03...Done. [Time elapsed: 0.438s] Optimizing table piwik_archive_blob_2015_03...Done. [Time elapsed: 0.031s]

Considering that my blob_2015_03.ibd is 995mb, the optimize finishing so quickly might indicate it's not really optimizing the table?

skyhawk669 commented 9 years ago

Manually running the core:purge-old-archive-data seems to mostly work for me, but like PatrickChen it doesn't seem to really optimize the tables, I have to manually run a table optimization after running the task.

diosmosis commented 9 years ago

If you use MySQL w/ InnoDB tables, it's expected that OPTIMIZE TABLES isn't run. At least, the code has been like that for years. It's strange that only now this code is causing problems.

@skyhawk669 @PatrickChen Do you notice the row counts of the archive tables increasing rapidly? In other words, if you leave your piwik for a day w/o running core:purge-old-archive-data so the scheduled task is run, does the row count still go up? Or is it just the table size?

If the row count doesn't balloon significantly, then it seems to be that optimize tables is not being executed, and perhaps it should be. But again, it's strange this is only a problem now.

@PatrickChen:

Apologies, how can i log my archive and scheduled task output?

You need to redirect the process output to a file. I'm not sure how to do that from a windows scheduled task.

rainyroads commented 9 years ago

Generally, as far as I'm aware, you should rarely ever need (or want) to run OPTIMIZE TABLES on InnoDB tables. This is one of the rare situations in which you'd want to. When deleting _huge_ amounts of data, it can leave a lot of fragmentation behind like this. Under normal use, I don't think it shouldn't ever make such a noticeable impact.

Quoting from the docs,

After doing substantial insert, update, or delete operations on an InnoDB table that has its own .ibd file because it was created with the innodb_file_per_table option enabled. The table and indexes are reorganized, and disk space can be reclaimed for use by the operating system. https://dev.mysql.com/doc/refman/5.1/en/optimize-table.html

The latest release of MariaDB seems to offer the ability to actually defragment InnoDB tables when running OPTIMIZE TABLES, without needing to actually rebuild everything, so I'm guessing it's good and reasonable to periodically run optimizations for servers running recent MariaDB releases.

skyhawk669 commented 9 years ago

@diosmosis
On Friday evening: 2015-01 blob table: ~100,000 rows 2015-03 blob table: ~80,000 rows

On Monday morning: 2015-01 and 2015-03: over 500,000 rows.

Yes the table rows still go up. After running the purge-old-archive-data task, both tables were back down around 100,000 and 85,000 respectively.

diosmosis commented 9 years ago

@skyhawk669 Can you run the following queries when one of your tables is bloated?

# selects invalidated archive count
SELECT COUNT(*) FROM archive_numeric WHERE name LIKE 'done%' AND value = 4;

# selects temporary archive count
SELECT COUNT(*) FROM archive_numeric WHERE name LIKE 'done%' AND value = 3;

# selects errored archive count
SELECT COUNT(*) FROM archive_numeric WHERE name LIKE 'done%' AND value = 2;

If the first query results in a very large count, then something is constantly invalidating archives. In this case, could you provide more details about your setup? Ie, do you use the log importer and do you import visits for dates that have already been archived?

If the second query results in an abnormally high count, there's a bug in Piwik's core archiving logic that's affecting your system. Finding the cause of this would probably require access to your server.

If the third query results in an abnormally high count, an error is occurring during archiving (either the core:archive command or browser triggered archiving) that results in archiving being aborted. Enabling debug logging during an archiving run that causes errors may help find the error that's occurring.

I'm not sure why the scheduled task isn't purging archives for you. I'm currently working on the task scheduler system as well, but for the time being I would suggest adding purge-old-archive-data as a daily cron task and replacing line 166 in https://github.com/piwik/piwik/blob/master/plugins/CoreAdminHome/Commands/PurgeOldArchiveData.php#L166 to Db::exec("OPTIMIZE TABLE $numericTable"); and line 171 to Db::exec("OPTIMIZE TABLE $blobTable");

diosmosis commented 9 years ago

@FujiMakoto

Generally, as far as I'm aware, you should rarely ever need (or want) to run OPTIMIZE TABLES on InnoDB tables. This is one of the rare situations in which you'd want to. When deleting huge amounts of data, it can leave a lot of fragmentation behind like this. Under normal use, I don't think it shouldn't ever make such a noticeable impact.

This clears things up, I'll add a way to force table optimizations for this sort of thing.

ThaDafinser commented 9 years ago

@diosmosis what are "high" numbers for 2/3/4? Currently i'm not running optimize nor purge...just the core:archive command!

e.g on my piwik for piwik_archive_numeric_2015_03 i see this:

//Update: i fetched all updates up to commit 1b545d301d8248118243dd21e33a3e1aca285f6c i will test now!

skyhawk669 commented 9 years ago

@diosmosis

selects invalidated archive count: 118 (after purging: 0) selects temporary archive count: 60 (after purging: 66) selects errored archive count: 0 (after purging: 0)

The above results are for 2015_03, I ran it against 2015_01 as well and got lower results. I assume those are low values? I had ran the purge yesterday morning, so the above is after a day.

I will add the purge in a daily cron task, by default it will purge the current month, right? It needs to purge from 2015_01 as well.

PatrickChen commented 9 years ago

@diosmosis

One day since manual core:purge-old-archive-data 2015-03 piwik_archive_numeric_2015_03 Row Count: 344489 (298299 after purge) selects invalidated archive count: 3114 (130 after purge) selects temporary archive count: 4998 (5366 after puge) selects errored archive count: 0 (0 after purge)

diosmosis commented 9 years ago

@skyhawk669 @PatrickChen @ThaDafinser

The amount of temporary + normal archives shouldn't exceed (31 + 4 + 1 + 1 + number of range archives) * number of sites. The number of invalidated archives depends on what you're doing w/ Piwik. I think, unless you are tracking visits in the past (eg, via the log importer), there should be 0. If archives become invalidated soon after they are created, it makes sense that this would continually bloat the table until purging executes. And if purging via the scheduled task doesn't work, it makes sense that the table would just get bigger.

I think Piwik is incorrectly invalidating all archives when tracking for the current month, resulting in archives being invalidated immediately after core:archive, creating table bloat. The bloat is then made worse by scheduled tasks not running correctly.

I'll be working on fixing these two issues ASAP.

@skyhawk669

I will add the purge in a daily cron task, by default it will purge the current month, right? It needs to purge from 2015_01 as well.

You're right, I hadn't thought of this. I will add the forcing optimize table change + purging current year archives momentarily, then you can set up the cron task.

diosmosis commented 9 years ago

@skyhawk669 @PatrickChen @ThaDafinser Forgot to ask, can you tell me how you track visits?

skyhawk669 commented 9 years ago

@diosmosis Javascript in page footers, through Drupal Piwik module.

diosmosis commented 9 years ago

@skyhawk669 You can setup a cron job for core:purge-old-archive-data after doing the following:

1) Save this file: https://github.com/piwik/piwik/blob/af2265857532355a9f74fc8efa0f26bf393d94fd/plugins/CoreAdminHome/Commands/PurgeOldArchiveData.php as plugins/CoreAdminHome/Commands/PurgeOldArchiveData.php 2) Save this file: https://github.com/piwik/piwik/blob/af2265857532355a9f74fc8efa0f26bf393d94fd/core/Db.php as core/Db.php

Then run the command as ./console core:purge-old-archive-data --include-year-archives --force-optimize-tables.

ThaDafinser commented 9 years ago

@diosmosis all trackings is done over JS tracking code. My archiving is done with hourly cronjobs (GUI triggering is disabled)

I deleted for now all archives and recreated them. That lowered of course the numbers i posted yesterday.

One problem i found for those SELECT * FROM piwik_archive_numeric_2015_03 WHERE name LIKE 'done%' AND value = 2 is that the archiving is done over http, which can sometimes result in an timeout...cant that get changed away from http and only use the console?

/console core:purge-old-archive-data --include-year-archives --force-optimize-tables didnt see the --force-optimized-tables options in the CLI output, see here

(im using commit 1b545d301d8248118243dd21e33a3e1aca285f6)

php D:\htdocs\piwik\console core:purge-old-archive-data --help
Usage:
 core:purge-old-archive-data [--exclude-outdated] [--exclude-invalidated] [--exc
lude-ranges] [--skip-optimize-tables] [dates1] ... [datesN]

Arguments:
 dates                   The months of the archive tables to purge data from. By
 default, only deletes from the current month. Use 'all' for all dates. (default
: ["2015-04-01"])

Options:
 --exclude-outdated      Do not purge outdated archive data.
 --exclude-invalidated   Do not purge invalidated archive data.
 --exclude-ranges        Do not purge custom ranges.
 --skip-optimize-tables  Do not run OPTIMIZE TABLES query on affected archive ta
bles.
 --help (-h)             Display this help message
 --quiet (-q)            Do not output any message
 --verbose (-v|vv|vvv)   Increase the verbosity of messages: 1 for normal output
, 2 for more verbose output and 3 for debug
 --version (-V)          Display this application version
 --ansi                  Force ANSI output
 --no-ansi               Disable ANSI output
 --no-interaction (-n)   Do not ask any interactive question
 --piwik-domain          Piwik URL (protocol and domain) eg. "http://piwik.examp
le.org"

Help:
 By default old and invalidated archives are purged. Custom ranges are also purg
ed with outdated archives.

 Note: archive purging is done during scheduled task execution, so under normal
circumstances, you should not need to run this command manually.
diosmosis commented 9 years ago

One problem i found for those SELECT * FROM piwik_archive_numeric_2015_03 WHERE name LIKE 'done%' AND value = 2 is that the archiving is done over http, which can sometimes result in an timeout...cant that get changed away from http and only use the console?

You can disable browser triggered archiving in the http://piwik.org/docs/setup-auto-archiving/#disable-browser-triggers-for-piwik-archiving-and-limit-piwik-reports-to-updating-every-hour . Is that what you were looking for?

/console core:purge-old-archive-data --include-year-archives --force-optimize-tables didnt see the --force-optimized-tables options in the CLI output, see here

The two options were added in https://github.com/piwik/piwik/pull/7594, which was merged yesterday, so your code isn't quite up to date.

mattab commented 9 years ago

Our next step here is to make sure we have identified the bug -

Anyone experiencing the issue still, could you please send us temporary access to your Piwik server? (FTP/SSH, Piwik URL, login/pwd). if you can email to matt at piwik.org or benaka att piwik.org it would be very appreciated!

Otherwise we are not sure how to proceed. thanks

ThaDafinser commented 9 years ago

You can disable browser triggered archiving in the http://piwik.org/docs/setup-auto-archiving/#disable-browser-triggers-for-piwik-archiving-and-limit-piwik-reports-to-updating-every-hour . Is that what you were looking for?

Nope. I disabled that already, but the cli command itself executes http requests to get the generated data. Which i dont like since it sometimes run into timeouts.

The two options were added in #7594, which was merged yesterday, so your code isn't quite up to date.

I downloaded again master with this patch and its in now. Currently my table size is "normal" but i will watch it. In 2015-03 i had 1.558.413 total rows....now there are 90.000 :smile: The problem is that i recreated it, so they table started fresh...

I also executed the query with your two new options...

These are my current cronjobs:

Do i need to execute the purge command with the 2 options from time to time? In schedulded tasks i see already Tasks.purgeOutdatedArchives Tasks.purgeInvalidatedArchives Tasks.optimizeArchiveTable ...

diosmosis commented 9 years ago

Nope. I disabled that already, but the cli command itself executes http requests to get the generated data. Which i dont like since it sometimes run into timeouts.

This means your system doesn't support the use of the CliMulti class. There are a lot of reasons for this, and I don't think the system check will say why the class isn't supported, so I'll link to the code: https://github.com/piwik/piwik/blob/master/core/CliMulti.php#L181 .

Do i need to execute the purge command with the 2 options from time to time?

If you are affected by the issue where the archive purging scheduled task does not run or does not run correctly, running the purge command can be used so you don't have bloated archives. If you run the task daily, you shouldn't need --force-optimized-tables (you also shouldn't need to use the option if you don't use InnoDB tables).

skyhawk669 commented 9 years ago

@mattab I can't give access to the server backend as it's behind our school district's firewall. I'll have to figure out if there's a way to copy that server over to an external location.

ThaDafinser commented 9 years ago

@diosmosis i keeped now the default cronjobs for the long weekend php console core:run-scheduled-tasks and php console core:archive --disable-scheduled-tasks --url=...

I never run php console core:purge-old-archive-data --include-year-archives --force-optimize-tales to see the (possible) grow again.

The result is after the weekend:

SELECT count(1), piwik_archive_numeric_2015_04.* 
FROM piwik_archive_numeric_2015_04 WHERE name LIKE 'done%' AND value = 3
group by period

20716 15661 done 1 2015-04-01 2015-04-01 1 2015-04-01 06:02:27 3 1650 46346 done 12 2015-04-06 2015-04-12 2 2015-04-04 00:05:07 3 4290 15843 done 2 2015-04-01 2015-04-30 3 2015-04-01 07:12:37 3

-> value = 4 or value = 2 is zero...so everything fine.

After running php console core:purge-old-archive-data all the numbers look good (afaik i think so), but the database size was the same 4038 56117 done 1 2015-04-06 2015-04-06 1 2015-04-06 00:02:05 3 1530 56987 done 12 2015-04-06 2015-04-12 2 2015-04-06 06:03:19 3 1530 57047 done 12 2015-04-01 2015-04-30 3 2015-04-06 06:03:36 3

So i run your new command php console core:purge-old-archive-data --include-year-archives --force-optimize-tables 2015-01 gone from 131MB to 73MB 2015-04 gone from 106MB to 32MB

:+1: so i will enable your new command once a day, then i think it should be good again.

But i think there is a (small) issue with the core:purge-old-archive-data: It always takes the actual month, so its not possible to "clean" the last seconds/minutes of the last day in the month, because it will switch to the new month.

skyhawk669 commented 9 years ago

Is there a reason why past blobs 2015_02 and 2015_03 would still increase a little bit? I see their row number increase by 5,000, it goes back down a bit after I manually run the purge-old-archive job for those tables.

diosmosis commented 9 years ago

@skyhawk669 If users view reports in those tables for range periods or for segments that aren't pre-archived, then new data will be archived and stored.

skyhawk669 commented 9 years ago

I don't think anybody has checked anything since the last time I've checked the table sizes (yesterday). But I'll check into it.

diosmosis commented 9 years ago

One other possible reason: archives are invalidated for older periods (for some reason), and on the next core:archive run they are re-archived. This would be hard to check for, though.

gaumondp commented 9 years ago

Here's my report after upgrading from 2.11b3 to 2.13b2 :

2.11.b3 2.13.b2
MySQL DB size 34,6.GB 22.1 GB
Report Table 24 GB 12.8 GB
MetricTable 99.3 M 82.4 M
March Report 8 GB 56.6 M

What I did :

  1. Upgrade using CLI : ./console core:update
  2. Wait 22 minutes to complete
  3. Run ./console core:purge-old-archive-data all
  4. Delete February, March and April 2015 piwikarchive to re-process those reports (http://piwik.org/faq/how-to/faq_59/)
  5. Run ./console core:archive --force-all-websites --force-all-periods=864000 --force-date-last-n=90 --url=http://1.2.3.4
  6. Run ./console cache:clear (should be part of "core:update" IMHO)
patriiiiiiiiiick commented 9 years ago

I am running 2.13.1 and was under the impression this issue was solved but it seems something remains a bit off:

archive_blob_2015_06 105.8 M 11.5 M 103,901 archive_blob_2015_05 11 G 291.9 M 5,483,520 archive_blob_2015_04 382 M 44.7 M 855,032 archive_blob_2015_03 407 M 54.7 M 778,564 archive_blob_2015_02 385 M 52.8 M 888,204 archive_blob_2015_01 12.9 G 261 M 3,050,483 archive_blob_2014_12 870 M 119 M 1,617,256 archive_blob_2014_11 904 M 118 M 1,423,529 archive_blob_2014_10 746 M 94.9 M 1,467,789

I presume I'll run "./console core:purge-old-archive-data all". Anything needed before I do that?

patriiiiiiiiiick commented 9 years ago

Also of interest, I suppose:

archive_numeric_2015_06 19.6 M 24.1 M 119,289 archive_numeric_2015_05 313 M 282.4 M 2,074,809 archive_numeric_2015_04 8.5 M 16.1 M 88,495 archive_numeric_2015_03 7.5 M 16.1 M 78,662 archive_numeric_2015_02 6.5 M 16.1 M 84,414 archive_numeric_2015_01 136.9 M 112.5 M 982,224 archive_numeric_2014_12 30.6 M 65.3 M 395,139 archive_numeric_2014_11 28.6 M 61.3 M 335,984 archive_numeric_2014_10 23.6 M 51.3 M 286,512

gaumondp commented 9 years ago

Gotta also chime in :

Archive_blob in April =200 M Archive_blob in May = 2.2 GB !

I'm still under 2.13.b2 and I haven't run "./console core:purge-old-archive-data all".

quba commented 9 years ago

Please update to 2.13.1 and run core:purge-old-archive-data all

If you are using InnoDB engine then please also run:

optimize table piwik_archive_blob_2015_05;
patriiiiiiiiiick commented 9 years ago

Can this take really extremely long? It has been started about 18 hours ago now and the last line of output is: Purging invalidated archives for 2015_01... Should we kill it?

Those tables currently stand at: archive_blob_2015_01 14.1 G 276 M 2,183,038 archive_numeric_2015_01 150.9 M 124.5 M 911,659

diosmosis commented 9 years ago

@patriiiiiiiiiick 14.1 gigabytes is very large for an archive table, so it's a conceivable that it would take a long, long time. Though 18 hours seems a bit strange to me. Can you check what queries are taking a long time? Note that if you stop the command, you should be able to start it again w/o necessarily having to start from scratch.

Another note for those who are experiencing this, the core:purge-old-archive-data isn't a fix for this issue, it's a workaround. We can't reproduce the issue, and so far we don't have access to a Piwik that has the issue, so we added the workaround. If you can give us access to an instance that has this problem (by emailing credentials to hello@piwik.org), we can debug and diagnose the real problem.

patriiiiiiiiiick commented 9 years ago

@diosmosis Seen giving the access to phpMyAdmin might not be the easiest, would it be possible to upload to some ftp server, one of the following, in order of preference, because of the size of the DB: 1) a dump of a few tables 2) the dump of the database (50GB as estimated by Piwik) 3) the image of a vmware machine?

I'll replicate this in an email to hello at piwik.org.

diosmosis commented 9 years ago

We'd need to debug the PHP code, so database dumps wouldn't help. An image of the machine might help (assuming Piwik is installed on it and you can replicate the problem), though I imagine it would be hard to upload it. Also, if you can replicate the problem on a test server, you could give us access to the test server.

Thanks for going through the trouble! Probably don't need to say it, but whatever you do, make sure you email sensitive information to hello@piwik.org, as opposed to posting here.

patriiiiiiiiiick commented 9 years ago

When you say you'd need to debug the PHP code, it means that a phpMyAdmin access would not be sufficient, right? I fear it's not possible to give a remote console access.

If my deductions are correct, sending the VM seems like the only option. I just dropped the tables of May and thereby decreased the size by 10GB. I could do the same for January and remove 22GB extra. You would still have the tables of June where the problem is present.

diosmosis commented 9 years ago

We'd need to be able to modify the PHP files, run commands from an SSH console and query the database.

You would still have the tables of June where the problem is present.

The data would not be helpful in determining a cause. Also, if purging the data and re-archiving fixes the issue permanently for you, then you are not affected by the issue other users are affected by. At least, not anymore.

patriiiiiiiiiick commented 9 years ago

I didn't force a re-archiving other than the one that occurs naturally through cron or through using the interface. Some reports do get archived but probably not all of them. Please tell me if you want me to launch a specific command to decide whether we are hitting the same issue.

My feeling now, thinking about it, is indeed that it is a new issue that came in either in 2.13.0 or 2.13.1. as it all seemed good in 2.12.1.

diosmosis commented 9 years ago

@patriiiiiiiiiick Keep watching your archive table sizes, if they keep growing, then you are affected by this issue.

patriiiiiiiiiick commented 9 years ago

Only the current ones grow excessively:

archive_blob_2015_06 11.6 G 395.9 M 2,709,345 archive_blob_2015_05 315 M 14.6 M 211,745 archive_blob_2015_04 382 M 44.7 M 924,286 archive_blob_2015_03 409 M 55.7 M 962,041 archive_blob_2015_02 387 M 52.8 M 783,770 archive_blob_2015_01 28.4 G 449 M 4,226,535 archive_blob_2014_12 871 M 119 M 1,445,442 archive_blob_2014_11 905 M 118 M 2,179,417 archive_blob_2014_10 746 M 94.9 M 1,155,959

I'll need to drop the table of January (the year) soon.

patriiiiiiiiiick commented 9 years ago

This issue persists even after having dropped the tables of June and January under 2.13.1.

archive_blob_2015_07 312 M 19.6 M 270,457 archive_blob_2015_06 4.1 G 114.8 M 1,954,516 archive_blob_2015_05 339 M 13.6 M 43,987 archive_blob_2015_04 383 M 44.7 M 682,825 archive_blob_2015_03 409 M 55.7 M 960,332 archive_blob_2015_02 387 M 52.8 M 701,516 archive_blob_2015_01 5 G 78.8 M 2,099,543

diosmosis commented 9 years ago

@patriiiiiiiiiick Any update on providing access to an affected server (the vmware image would also work)? There have been others who have been affected, and we'd really like to find the root cause of this issue.

patriiiiiiiiiick commented 9 years ago

@diosmosis The easiest for me would be to provide a DB dump. Are you sure this would not be enough? You would just import the dump in an existing configuration, wouldn't you?

diosmosis commented 9 years ago

@patriiiiiiiiiick The data in the DB isn't causing the problem, it's more the symptom of the problem. Since we can't reproduce the problem, it must occur because of a specific setup. It could be due to some internal detail of the OS you are using, or the MySQL version you're using or due to a combination of different things. In other words, Piwik works normally for us, but something in a significant number of users' environments is causing Piwik to malfunction. If I imported your database into a Piwik on my machine, it would probably work.

Unfortunately the only way to find the root cause here is to find a server affected and dig into the code to see what's failing.

If this is not possible, you can use the core:purge-old-archive-data to work around the issue, just make sure to run it daily (ie as part of a cron), otherwise the table will keep growing.

tassoman commented 9 years ago

I think a full environment to debug is needed. Maybe because of PHP version, mySql/Maria version and other libraries situation.

Did you tried running unit tests on your installation? Could be useful?

diosmosis commented 9 years ago

Did you tried running unit tests on your installation? Could be useful?

Running tests could be useful, though I wouldn't recommend doing it on a production instance. Also our tests can be somewhat of a pain to run...