matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.68k stars 2.62k forks source link

Search Keywords with No Results report issue (probably wrong filtering with lots of data) #11280

Open mkrakiewicz opened 7 years ago

mkrakiewicz commented 7 years ago

Hi, I have a problem with "Search Keywords with No Results" report. Piwik version: 2.16.2.

Problem:

  1. Using a segment in this report results in more data than is available for "All visits" setting (makes no sense). In this case I'm using example segment "visitEcommerceStatus==abandonedCart" but I think it does not matter. This is only happening with lots of data (I have a case of 1 day of 250K actions).

Steps to reproduce:

  1. For some specific day (1 day is enough) have enough data so that this method's: https://github.com/piwik/piwik/blob/2.16.2/plugins/Actions/API.php#L280 resulting DataTable is full and is limited to 499 rows (according to the setting datatable_archiving_maximum_rows_actions = 500).

  2. Have records so that this condition below is fulfilled so that rows with `28`= 1 (`28` is 'INDEX_SITE_SEARCH_HAS_NO_RESULT') are spread evenly in the result set: CASE WHEN (MAX(log_link_visit_action.custom_var_v5) = 0 AND log_link_visit_action.custom_var_k5 = '_pk_scount') THEN 1 ELSE 0 END AS `28`. Here is the full archiving query I believe is used for aggregation: https://gist.github.com/mkrakiewicz/e3219368fa32a1a7491d49904a9418cf

  3. Then, resulting rows are filtered AFTER the result set is "clipped" to 500 records: https://github.com/piwik/piwik/blob/2.16.2/plugins/Actions/API.php#L286

  4. If I set datatable_archiving_maximum_rows_actions = -1, then the report is correct (after choosing a segment there are less keywords than in "All results" setting). Also several missing keywords appear. However then in above API method the DataTable row count is way more than default 500 (thousands in my case).

Conclusion: Because the archiving query has several subqueries/groupings, there can be more records matching `28`= 1 than DataTable allows. Also, because each keyword is listed, this column cannot be summed in other way I believe. That means, even if in DataTable there are 500 rows, in the report there can be a few keywords or 0 records after filtering is done, while in reality there are several keywords which were cut off from the result.

Proposed fix:

  1. Remove the filtering in API method.

  2. Apply filtering on query level during archiving and have separate report for that. Currently Actions_sitesearch is used both by getSiteSearchNoResultKeywords and getSiteSearchKeywords.

mattab commented 7 years ago

Thanks for the report @mkrakiewicz

It would be super helpful if you could submit a Pull request with a new System test that shows the bug. This way we can see it reproduced and it's much easier for us to fix it afterwards. Let me know... thanks