matomo-org / matomo-log-analytics

Import any kind of server logs in Matomo for powerful log analytics. Universal log file parsing and reporting.
https://matomo.org/log-analytics/
GNU General Public License v3.0
226 stars 118 forks source link

no bot tracked in logs #364

Closed sandrocantagallo closed 2 months ago

sandrocantagallo commented 1 year ago

I’m here to list the problems I’m still experiencing.

From a first analysis the problem seemed to be the crontab in which there were parameters that said to import everything.

The new crontab is therefore like this:

0 22 python3 /var/www/html/matomo/misc/log-analytics/import_logs.py --url=http://19...1/matomo/ --idsite=2 /var/log/httpd/443-access_log > /home/ /***logs/matomo_import.log

But the system continues to not work and BOTs continue to be entered in the visit logs.

We then activated the user agents within the access log file.

192.168.32.229 - - [22/Oct/2023:03:28:20 +0200] “GET /templates/jsn_solid_pro/js/jsn_link_profession_selected.js?ver=1697760000 HTTP/1.1” 200 3440 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.5993.70 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

But even in this way the import system continues to import incorrectly.

Logs import summary 129793 requests imported successfully 4453 requests were downloads 193049 requests ignored: 625 HTTP errors 1116 HTTP redirects 13443 invalid log lines 0 filtered log lines 0 requests did not match any known site 0 requests did not match any --hostname 0 requests done by bots, search engines... 177865 requests to static resources (css, js, images, ico, ttf...) 0 requests to file downloads did not match any --download-extensions At the moment I don’t know how to solve the problem. It would seem that the import system via logs does not work well.

Do you have any suggestions?

sandrocantagallo commented 1 year ago

I think that problem is in log format that is not compatible.

If i use: --dump-log-regex

I receive:

2023-10-30 11:29:03,565: [INFO] Using format 'common'.
2023-10-30 11:29:03,565: [INFO] Regex being used: (?P<ip>[\w*.:-]+)\s+\S+\s+(?P<userid>\S+)\s+\[(?P<date>.*?)\s+(?P<timezone>.*?)\]\s+"(?P<method>\S+)\s+(?P<path>.*?)\s+\S+"\s+(?P<status>\d+)\s+(?P<length>\S+)

but in this way is not possible for script to exlude bot.

sandrocantagallo commented 1 year ago

I Make this test:

import re

regex_pattern = r'(?P<ip>[\w*.:-]+)\s+\S+\s+(?P<userid>\S+)\s+\[(?P<date>.*?)\s+(?P<timezone>.*?)\]\s+"(?P<method>\S+)\s+(?P<path>.*?)\s+\S+"\s+(?P<status>\d+)\s+(?P<length>\S+)\s+(?P<user_agent>.+)'

input_string = '192.168.32.229 - - [29/Oct/2023:03:45:19 +0100] "GET /relazione-annuale/relazione_annuale_2022.html HTTP/1.1" 200 27518 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.5993.117 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'

match = re.search(regex_pattern, input_string)

if match:
    ip = match.group('ip')
    userid = match.group('userid')
    date = match.group('date')
    timezone = match.group('timezone')
    method = match.group('method')
    path = match.group('path')
    status = match.group('status')
    length = match.group('length')
    user_agent = match.group('user_agent')

    print(f"IP: {ip}")
    print(f"User ID: {userid}")
    print(f"Date: {date}")
    print(f"Timezone: {timezone}")
    print(f"Method: {method}")
    print(f"Path: {path}")
    print(f"Status: {status}")
    print(f"Agent: {user_agent}")
else:
    print("Nessuna corrispondenza trovata.")

AND it works and return:

IP: 192.168.32.229 User ID: - Date: 29/Oct/2023:03:45:19 Timezone: +0100 Method: GET Path: /relazione-annuale/relazione_annuale_2022.html Status: 200 Agent: Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.5993.117 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Now. there is a way to sei at the import what regular expression use to read the logs and also useragent to exclude BOT from analytics ?

sandrocantagallo commented 1 year ago

I found that there is this parameter --log-format-regex='' that must resolve problem but if i use the regolar expression create: r'(?P[\w.:-]+)\s+\S+\s+(?P\S+)\s+[(?P.?)\s+(?P.?)]\s+"(?P\S+)\s+(?P.?)\s+\S+"\s+(?P\d+)\s+(?P\S+)\s+(?P.+)'

Script go in error.

sandrocantagallo commented 1 year ago

So i modify the script: misc/log-analytics/import_logs.py

Added new FORMAT to regular expression:

_TEST_EXTENDED_LOG_FORMAT = (_COMMON_LOG_FORMAT +
    r'\s+(?P<user_agent>.+)'
)

then in FORMATS added the new one:

FORMATS = {
    'common': RegexFormat('common', _COMMON_LOG_FORMAT),
    'test': RegexFormat('test', _TEST_EXTENDED_LOG_FORMAT),
    'common_vhost': RegexFormat('common_vhost', _HOST_PREFIX + _COMMON_LOG_FORMAT),
    'ncsa_extended': RegexFormat('ncsa_extended', _NCSA_EXTENDED_LOG_FORMAT),
    'common_complete': RegexFormat('common_complete', _HOST_PREFIX + _NCSA_EXTENDED_LOG_FORMAT),
    'w3c_extended': W3cExtendedFormat(),
    'amazon_cloudfront': AmazonCloudFrontFormat(),
    'incapsula_w3c': IncapsulaW3CFormat(),
    'iis': IisFormat(),
    'shoutcast': ShoutcastFormat(),
    's3': RegexFormat('s3', _S3_LOG_FORMAT),
    'icecast2': RegexFormat('icecast2', _ICECAST2_LOG_FORMAT),
    'elb': RegexFormat('elb', _ELB_LOG_FORMAT, '%Y-%m-%dT%H:%M:%S'),
    'nginx_json': JsonFormat('nginx_json'),
    'ovh': RegexFormat('ovh', _OVH_FORMAT),
    'haproxy': RegexFormat('haproxy', _HAPROXY_FORMAT, '%d/%b/%Y:%H:%M:%S.%f'),
    'gandi': RegexFormat('gandi', _GANDI_SIMPLE_HOSTING_FORMAT, '%d/%b/%Y:%H:%M:%S')
}

force use of this new format during import:

log-analytics % python3 import_logs.py --url=http://localhost/y-analytics 443-access_log.txt --idsite=8 --log-format-name="test"

Now import work and BOT are find

Logs import summary
-------------------

    79 requests imported successfully
    10 requests were downloads
    780 requests ignored:
        3 HTTP errors
        2 HTTP redirects
        21 invalid log lines
        0 filtered log lines
        0 requests did not match any known site
        0 requests did not match any --hostname
        39 requests done by bots, search engines...
        715 requests to static resources (css, js, images, ico, ttf...)
        0 requests to file downloads did not match any --download-extensions

There is a way to do the same without hard code the script. ?

L3on1d commented 11 months ago

@sandrocantagallo Hi!

Have you tried this key "--enable-bots"?

https://matomo.org/faq/general/import-additional-data-including-bots-static-files-and-http-errors-tracking/

sandrocantagallo commented 11 months ago

@sandrocantagallo Hi!

Have you tried this key "--enable-bots"?

https://matomo.org/faq/general/import-additional-data-including-bots-static-files-and-http-errors-tracking/

I want to exclude BOT from Matomo stats. With this flag the bot will ne included in matomo stats. For me the problema is that the log importer can't find bot inside my log. I have resolved by a mod on the python script.

sgiehl commented 2 months ago

@sandrocantagallo The log importer only detects a very basic set of common bots, to prevent their visits from being sent to the tracker at all. In addition to that, Matomo itself uses device detector to detect a lot more bots. If Matomo detects a bot, the tracker will also drop the visit. So even if the log importer doesn't detect some bots, Matomo might still do it and drop their visits.