Closed sandrocantagallo closed 2 months ago
I think that problem is in log format that is not compatible.
If i use: --dump-log-regex
I receive:
2023-10-30 11:29:03,565: [INFO] Using format 'common'.
2023-10-30 11:29:03,565: [INFO] Regex being used: (?P<ip>[\w*.:-]+)\s+\S+\s+(?P<userid>\S+)\s+\[(?P<date>.*?)\s+(?P<timezone>.*?)\]\s+"(?P<method>\S+)\s+(?P<path>.*?)\s+\S+"\s+(?P<status>\d+)\s+(?P<length>\S+)
but in this way is not possible for script to exlude bot.
I Make this test:
import re
regex_pattern = r'(?P<ip>[\w*.:-]+)\s+\S+\s+(?P<userid>\S+)\s+\[(?P<date>.*?)\s+(?P<timezone>.*?)\]\s+"(?P<method>\S+)\s+(?P<path>.*?)\s+\S+"\s+(?P<status>\d+)\s+(?P<length>\S+)\s+(?P<user_agent>.+)'
input_string = '192.168.32.229 - - [29/Oct/2023:03:45:19 +0100] "GET /relazione-annuale/relazione_annuale_2022.html HTTP/1.1" 200 27518 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.5993.117 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
match = re.search(regex_pattern, input_string)
if match:
ip = match.group('ip')
userid = match.group('userid')
date = match.group('date')
timezone = match.group('timezone')
method = match.group('method')
path = match.group('path')
status = match.group('status')
length = match.group('length')
user_agent = match.group('user_agent')
print(f"IP: {ip}")
print(f"User ID: {userid}")
print(f"Date: {date}")
print(f"Timezone: {timezone}")
print(f"Method: {method}")
print(f"Path: {path}")
print(f"Status: {status}")
print(f"Agent: {user_agent}")
else:
print("Nessuna corrispondenza trovata.")
AND it works and return:
IP: 192.168.32.229 User ID: - Date: 29/Oct/2023:03:45:19 Timezone: +0100 Method: GET Path: /relazione-annuale/relazione_annuale_2022.html Status: 200 Agent: Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.5993.117 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
Now. there is a way to sei at the import what regular expression use to read the logs and also useragent to exclude BOT from analytics ?
I found that there is this parameter --log-format-regex='' that must resolve problem but if i use the regolar expression create:
r'(?P
Script go in error.
So i modify the script: misc/log-analytics/import_logs.py
Added new FORMAT to regular expression:
_TEST_EXTENDED_LOG_FORMAT = (_COMMON_LOG_FORMAT +
r'\s+(?P<user_agent>.+)'
)
then in FORMATS added the new one:
FORMATS = {
'common': RegexFormat('common', _COMMON_LOG_FORMAT),
'test': RegexFormat('test', _TEST_EXTENDED_LOG_FORMAT),
'common_vhost': RegexFormat('common_vhost', _HOST_PREFIX + _COMMON_LOG_FORMAT),
'ncsa_extended': RegexFormat('ncsa_extended', _NCSA_EXTENDED_LOG_FORMAT),
'common_complete': RegexFormat('common_complete', _HOST_PREFIX + _NCSA_EXTENDED_LOG_FORMAT),
'w3c_extended': W3cExtendedFormat(),
'amazon_cloudfront': AmazonCloudFrontFormat(),
'incapsula_w3c': IncapsulaW3CFormat(),
'iis': IisFormat(),
'shoutcast': ShoutcastFormat(),
's3': RegexFormat('s3', _S3_LOG_FORMAT),
'icecast2': RegexFormat('icecast2', _ICECAST2_LOG_FORMAT),
'elb': RegexFormat('elb', _ELB_LOG_FORMAT, '%Y-%m-%dT%H:%M:%S'),
'nginx_json': JsonFormat('nginx_json'),
'ovh': RegexFormat('ovh', _OVH_FORMAT),
'haproxy': RegexFormat('haproxy', _HAPROXY_FORMAT, '%d/%b/%Y:%H:%M:%S.%f'),
'gandi': RegexFormat('gandi', _GANDI_SIMPLE_HOSTING_FORMAT, '%d/%b/%Y:%H:%M:%S')
}
force use of this new format during import:
log-analytics % python3 import_logs.py --url=http://localhost/y-analytics 443-access_log.txt --idsite=8 --log-format-name="test"
Now import work and BOT are find
Logs import summary
-------------------
79 requests imported successfully
10 requests were downloads
780 requests ignored:
3 HTTP errors
2 HTTP redirects
21 invalid log lines
0 filtered log lines
0 requests did not match any known site
0 requests did not match any --hostname
39 requests done by bots, search engines...
715 requests to static resources (css, js, images, ico, ttf...)
0 requests to file downloads did not match any --download-extensions
There is a way to do the same without hard code the script. ?
@sandrocantagallo Hi!
Have you tried this key "--enable-bots"?
@sandrocantagallo Hi!
Have you tried this key "--enable-bots"?
I want to exclude BOT from Matomo stats. With this flag the bot will ne included in matomo stats. For me the problema is that the log importer can't find bot inside my log. I have resolved by a mod on the python script.
@sandrocantagallo The log importer only detects a very basic set of common bots, to prevent their visits from being sent to the tracker at all. In addition to that, Matomo itself uses device detector to detect a lot more bots. If Matomo detects a bot, the tracker will also drop the visit. So even if the log importer doesn't detect some bots, Matomo might still do it and drop their visits.
I’m here to list the problems I’m still experiencing.
From a first analysis the problem seemed to be the crontab in which there were parameters that said to import everything.
The new crontab is therefore like this:
0 22 python3 /var/www/html/matomo/misc/log-analytics/import_logs.py --url=http://19...1/matomo/ --idsite=2 /var/log/httpd/443-access_log > /home/ /***logs/matomo_import.log
But the system continues to not work and BOTs continue to be entered in the visit logs.
We then activated the user agents within the access log file.
192.168.32.229 - - [22/Oct/2023:03:28:20 +0200] “GET /templates/jsn_solid_pro/js/jsn_link_profession_selected.js?ver=1697760000 HTTP/1.1” 200 3440 Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.5993.70 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
But even in this way the import system continues to import incorrectly.
Logs import summary 129793 requests imported successfully 4453 requests were downloads 193049 requests ignored: 625 HTTP errors 1116 HTTP redirects 13443 invalid log lines 0 filtered log lines 0 requests did not match any known site 0 requests did not match any --hostname 0 requests done by bots, search engines... 177865 requests to static resources (css, js, images, ico, ttf...) 0 requests to file downloads did not match any --download-extensions At the moment I don’t know how to solve the problem. It would seem that the import system via logs does not work well.
Do you have any suggestions?