matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.89k stars 2.65k forks source link

Archiving process causes CPU spikes on Master DB #19384

Open MorganFujimaka opened 2 years ago

MorganFujimaka commented 2 years ago

During archiving we notice a lot of read queries causing CPU spikes on master DB:

SELECT `arc1` . `idarchive` , `arc1` . `value` , `arc1` . `name` , `arc1` . `ts_archived` , `arc1` . `date1` AS `startDate` , `arc2` . `value` AS `nb_visits` , `arc3` . `value` AS `nb_visits_converted` FROM `matomo_archive_numeric_2022_06` `arc1` LEFT JOIN `matomo_archive_numeric_2022_06` `arc2` ON `arc2` . `idarchive` = `arc1` . `idarchive` AND ( `arc2` . `name` = ? ) LEFT JOIN `matomo_archive_numeric_2022_06` `arc3` ON `arc3` . `idarchive` = `arc1` . `idarchive` AND ( `arc3` . `name` = ? ) WHE

We face performance issues toward the end of the month when a matomo_archive_numeric_2022_** table is becoming too big probably.

The function that call the query: https://github.com/matomo-org/matomo/blob/296de5d46102cfcc7dddd322972edbc631bf2b54/core/DataAccess/Model.php#L436

        $sqlQuery = "SELECT arc1.idarchive, arc1.value, arc1.name, arc1.ts_archived, arc1.date1 as startDate, arc2.value as " . ArchiveSelector::NB_VISITS_RECORD_LOOKED_UP . ", arc3.value as " . ArchiveSelector::NB_VISITS_CONVERTED_RECORD_LOOKED_UP . "
                     FROM $numericTable arc1
                     LEFT JOIN $numericTable arc2 on arc2.idarchive = arc1.idarchive and (arc2.name = '" . ArchiveSelector::NB_VISITS_RECORD_LOOKED_UP . "')
                     LEFT JOIN $numericTable arc3 on arc3.idarchive = arc1.idarchive and (arc3.name = '" . ArchiveSelector::NB_VISITS_CONVERTED_RECORD_LOOKED_UP . "')
                     WHERE arc1.idsite = ?
                         AND arc1.date1 = ?
                         AND arc1.date2 = ?
                         AND arc1.period = ?
                         AND ($sqlWhereArchiveName)
                         $timeStampWhere
                         AND arc1.ts_archived IS NOT NULL
                     ORDER BY arc1.ts_archived DESC, arc1.idarchive DESC";

        $results = Db::fetchAll($sqlQuery, $bindSQL);

Why does it query master DB and not a read replica? Thank you!

bx80 commented 2 years ago

Hi @MorganFujimaka, I can't say for sure, but loading the list of archives from the master database is probably to avoid issues any with ongoing archiving processes and read replica lag.

How many records are typically in matomo_archive_numeric_2022_* by the of the month?

MorganFujimaka commented 2 years ago

Hi @bx80, thank you for the prompt response! We have ~140 000 rows in the matomo_archive_numeric_2022_* tables toward the end of the month. Most of the rows are for our main website, other websites are not impacted by performance issues. We have 30 funnels, 95 goals, and 3 segments for our main website (which increase greatly the number of rows).

bx80 commented 2 years ago

Thanks @MorganFujimaka, these details will be really helpful for understanding how we can improve performance in this area :+1: