matomo-org / device-detector

The Universal Device Detection library will parse any User Agent and detect the browser, operating system, device used (desktop, tablet, mobile, tv, cars, console, etc.), brand and model.
http://devicedetector.net
GNU Lesser General Public License v3.0
2.95k stars 476 forks source link

import_logs.py don't ignore BOT #7493

Closed sandrocantagallo closed 1 year ago

sandrocantagallo commented 1 year ago

Crontab:

0 22 * * * python3 /var/www/html/matomo/misc/log-analytics/import_logs.py --url=[http://19](http://0.0.0.19/)...1/matomo/ --idsite=2 /var/log/httpd/443-access_log > /home/* /***logs/matomo_import.log

Access Logs:

192.***.*** - - [22/Oct/2023:03:28:20 +0200] “GET /templates/jsn_solid_pro/js/jsn_link_profession_selected.js?ver=1697760000 HTTP/1.1” 200 3440 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.5993.70 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

But in summary logs

129793 requests imported successfully
4453 requests were downloads
193049 requests ignored:
    625 HTTP errors
    1116 HTTP redirects
    13443 invalid log lines
    0 filtered log lines
    0 requests did not match any known site
    0 requests did not match any --hostname
    0 requests done by bots, search engines...
    177865 requests to static resources (css, js, images, ico, ttf...)
    0 requests to file downloads did not match any --download-extensions

the problem is. 0 requests done by bots, search engines...

Do you have any suggestions?

sgiehl commented 1 year ago

Hi @sandrocantagallo The statistic displayed by the log importer is unrelated to device detector. The log importer already tries to filter away some common bots. See https://github.com/matomo-org/matomo-log-analytics/blob/5.x-dev/import_logs.py#L84-L114 Everything that is not detected there might later be detected by device detector in Matomo itself. Depending on the settings Matomo might then drop those tracking requests.

sandrocantagallo commented 1 year ago

The log importer already tries to filter away some common bots

The problem is that in my case this feature didn't work

The problem was due to the regular expression used by the import script.

To fix the problem I had to modify the import script

_TEST_EXTENDED_LOG_FORMAT = (_COMMON_LOG_FORMAT +
    r'\s+(?P<user_agent>.+)'
)
FORMATS = {
    'common': RegexFormat('common', _COMMON_LOG_FORMAT),
    'test': RegexFormat('test', _TEST_EXTENDED_LOG_FORMAT),

Then I force the use of this rule when I launch the import command

python3 import_logs.py --url=http://localhost/y-analytics access_log.txt --idsite=8 --log-format-name="test"

At this point the script knows how to read my LOG and recognizes the user agent.

Logs import summary

79 requests imported successfully
10 requests were downloads
780 requests ignored:
    3 HTTP errors
    2 HTTP redirects
    21 invalid log lines
    0 filtered log lines
    0 requests did not match any known site
    0 requests did not match any --hostname
    **_39 requests done by bots, search engines..._**
    715 requests to static resources (css, js, images, ico, ttf...)
    0 requests to file downloads did not match any --download-extensions

I also saw that there is a parameter to force the import with a specific regular expression:

--log-format-regex

but I couldn't get it to work. The problem is the documentation on the topic is too sparse in case of problems.