matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.68k stars 2.62k forks source link

Log analytics list of improvements #3163

Closed mattab closed 9 years ago

mattab commented 12 years ago

In Piwik 1.8 we released the great new feature to import access logs and generate statistics.

The V1 release works very well (it was tracked in #703), but there are ideas to improve it. This ticket is a placeholder of all ideas and discussions related to the Log Analytics feature!

New features

PERFORMANCE'

How to debug performance? First of all, you can run the script with --dry-run to see how many log lines per second are parsed. It typically should be between 2,000 and 5,000. When you don't do a dry run, it will insert new pageviews and visits calling Piwik API.

Other tickets

mattab commented 11 years ago

There was another user in the forums reporting an error: view post

Could we explain the bug when it happens, and fail with a relevant error/notice message?

anonymous-matomo-user commented 11 years ago

Here is my CustomLog line (line breaks for better reading):

CustomLog "|/var/www/piwik.skweez.net/piwik/misc/log-analytics/import_logs.py
--url=http://piwik.skweez.net/ --add-sites-new-hosts
--output=/var/www/update.skweez.net/logs/piwik.log --recorders=4
--log-format-name=common_vhost -dd -" vhost_combined

Here is the log that is generated:

...
2012-11-23 22:35:07,759: [DEBUG] Launched recorder
2012-11-23 22:35:07,761: [DEBUG] Launched recorder
2012-11-23 22:35:07,762: [DEBUG] Launched recorder
2012-11-23 22:35:07,763: [DEBUG] Launched recorder
2012-11-24 06:30:01,375: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-24 06:30:01,378: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-24 06:30:01,633: [DEBUG] Accepted hostnames: all
2012-11-24 06:30:01,633: [DEBUG] Piwik URL is: http://piwik.skweez.net/
2012-11-24 06:30:01,633: [DEBUG] No token-auth specified
2012-11-24 06:30:01,633: [DEBUG] No credentials specified, reading them from "/var/www/piwik.skweez.net/piwik/config/config.ini.php"
2012-11-24 06:30:01,648: [DEBUG] Using credentials: (login = piwikadmin, password = ...)
2012-11-24 06:30:02,065: [DEBUG] Site ID for hostname update.skweez.net: 7
2012-11-24 06:30:02,709: [DEBUG] Site ID for hostname update.skweez.net: 7
Purging Piwik archives for dates: 2012-11-23 2012-11-24
2012-11-24 06:30:02,935: [DEBUG] Authentication token token_auth is: ...
2012-11-24 06:30:02,935: [DEBUG] Resolver: dynamic
2012-11-24 06:30:02,936: [DEBUG] Launched recorder
2012-11-24 06:30:02,938: [DEBUG] Launched recorder
2012-11-24 06:30:02,940: [DEBUG] Launched recorder
2012-11-24 06:30:02,941: [DEBUG] Launched recorder

Logs import summary
-------------------

    5 requests imported successfully
    14 requests were downloads
    15 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        1 HTTP errors
        0 HTTP redirects
        14 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    5 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:

Performance summary
-------------------

    Total time: 28495 seconds
    Requests imported per second: 0.0 requests per second

2012-11-25 06:33:02,723: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-25 06:33:02,723: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-25 06:33:02,724: [DEBUG] Site ID for hostname update.skweez.net not in cache
2012-11-25 06:33:03,104: [DEBUG] Site ID for hostname update.skweez.net: 7
2012-11-25 06:33:03,136: [DEBUG] Site ID for hostname update.skweez.net: 7
2012-11-25 06:33:03,141: [DEBUG] Site ID for hostname update.skweez.net: 7
2012-11-25 06:33:03,372: [DEBUG] Accepted hostnames: all
2012-11-25 06:33:03,372: [DEBUG] Piwik URL is: http://piwik.skweez.net/
2012-11-25 06:33:03,372: [DEBUG] No token-auth specified
2012-11-25 06:33:03,372: [DEBUG] No credentials specified, reading them from "/var/www/piwik.skweez.net/piwik/config/config.ini.php"
2012-11-25 06:33:03,373: [DEBUG] Using credentials: (login = piwikadmin, password = ...)
2012-11-25 06:33:03,492: [DEBUG] Authentication token token_auth is: ...
2012-11-25 06:33:03,492: [DEBUG] Resolver: dynamic
2012-11-25 06:33:03,493: [DEBUG] Launched recorder
2012-11-25 06:33:03,494: [DEBUG] Launched recorder
2012-11-25 06:33:03,495: [DEBUG] Launched recorder
2012-11-25 06:33:03,495: [DEBUG] Launched recorder
Purging Piwik archives for dates: 2012-11-25 2012-11-24

Logs import summary
-------------------

    9 requests imported successfully
    42 requests were downloads
    42 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        3 HTTP errors
        0 HTTP redirects
        39 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    9 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:

Performance summary
-------------------

    Total time: 86580 seconds
    Requests imported per second: 0.0 requests per second

Logs import summary
-------------------

    0 requests imported successfully
    0 requests were downloads
    0 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    0 requests imported to 0 sites
        0 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:

Performance summary
-------------------

    Total time: 12 seconds
    Requests imported per second: 0.0 requests per second

2012-11-25 06:33:16,016: [DEBUG] Accepted hostnames: all
2012-11-25 06:33:16,016: [DEBUG] Piwik URL is: http://piwik.skweez.net/
2012-11-25 06:33:16,016: [DEBUG] No token-auth specified
2012-11-25 06:33:16,016: [DEBUG] No credentials specified, reading them from "/var/www/piwik.skweez.net/piwik/config/config.ini.php"
2012-11-25 06:33:16,017: [DEBUG] Using credentials: (login = piwikadmin, password = ...)
2012-11-25 06:33:16,156: [DEBUG] Authentication token token_auth is: ...
2012-11-25 06:33:16,156: [DEBUG] Resolver: dynamic
2012-11-25 06:33:16,157: [DEBUG] Launched recorder
2012-11-25 06:33:16,157: [DEBUG] Launched recorder
2012-11-25 06:33:16,159: [DEBUG] Launched recorder
2012-11-25 06:33:16,159: [DEBUG] Launched recorder

So it is getting the logs when apache is reloading, which it does at night after logrotate.

aspectra commented 11 years ago

Hi, I would be glad, if you could add a new option to the script. It should only import the loglines with a specified path included. So do exactly the opposite of the --exclude-path-from option. As far as I understand we could just copy/paste the def check_path part and change the "True" and "False" return values. I posted the part with the changes.

    def check_path(self, hit):
        for include_path in config.options.included_paths:
            if fnmatch.fnmatch(hit.path, included_path):
                return True
        return False

Unfortunately I don't know where to modify the script to add this option.

Many thanks for your help.

anonymous-matomo-user commented 11 years ago

Hi all, I am new to this piwik. So, I installed piwik on an apache webserver and I tried to import a log file from a Tomcat webserver but I get the following error: Fatal error: Cannot guess the logs format. Please give one using either the --log-format-name or --log-format-regex option This is the command that I used: python /var/www/piwik/misc/log-analytics/import_logs.py --url=http://192.168.1.100/piwik/ /home/user/app1/catalina.2012-12-10.log --idsite=1 --recorders=1 --enable-http-errors --enable-http-redirects --enable-static --enable-bots And this is what the log file contains: Dec 10, 2012 12:02:50 AM org.apache.catalina.core.StandardWrapperValve invoke INFO: 2012-12-10 00:02:50,000 - DEBUG InOutCallableStatementCreator#<init> - Call: AdminReports.GETAPPLICATIONINFO(?)

I tried googling it but I didn't find much. Also I tried the piwik forum but the same. Can you help me? What parameter shall use with --log-format-name or --log-format-regex option?

mattab commented 11 years ago

In trunk, when I CTRL+C the script, it does not exit directly, it takes 5-10 seconds before the software stops running an then outputs the log. I think it is a recent regression ?

anonymous-matomo-user commented 11 years ago

Suggestion - Bandwidth Usage

I used to see it on awstats... http://forum.piwik.org/read.php?2,98279,98330#msg-98330

There is no size information on logs, but i guess awstats check the acessed files on logs, and count it.

mattab commented 11 years ago

For piwik.php performance improvements and asynchronous data imports, see #3632

anonymous-matomo-user commented 11 years ago

Has anyone found a solution to this yet? I'm having the same problem with my IIS logs not importing.

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log Z:\logs\W3SVC14\u_ex121218.log...
1648 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current
)
1648 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current
)
1648 lines parsed, 43 lines recorded, 14 records/sec (avg), 43 records/sec (curr
ent)
1648 lines parsed, 43 lines recorded, 10 records/sec (avg), 0 records/sec (curre
nt)
Fatal error: None
You can restart the import of "Z:\logs\W3SVC14
\u_ex121218.log" from the point it failed by specifying --skip=3 on the command
line.

Replying to unaidswebmaster:

I'm trying to import our IIS logs using import_logs.py but it keeps hitting a snag somewhere in the middle. The message simply says:

Fatal error: None
You can restart the import of "d:\tmp\logfiles\ex120803.log" from the point it failed by specifying --skip=215201 on the command line.

When I restart it with the skip parameter, it would not record any more lines and fail again a few lines down (see output below)

C:\Python27>python "d:\websites\piwik\misc\log-analytics\import_logs.py" --url=h
ttp://piwikpre.unaids.org/ "d:\tmp\logfiles\ex120803.log" --idsite=2 --skip=2152
01
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log d:\tmp\logfiles\ex120803.log...
182921 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
218630 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
222550 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
227111 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
231539 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
235666 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
240261 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
244780 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Fatal error: None
You can restart the import of "d:\tmp\logfiles\ex120803.log" from the point it failed by specifying --skip=215225 on the command line.

The format we are using is W3C Extended Log File Format and we are tracking extended properties, such as Host, Cookie, and Referer. I'd like to send the log file that I used for this example, but it's too big to be attached (20Mb even when zipped). Can I send it by some other means?

Thanks a lot! -Jo

anonymous-matomo-user commented 11 years ago

Checking in on the IIS logs not importing issue. I'm having the same issue as Jo reported here. The errors are the same.

anonymous-matomo-user commented 11 years ago

I am running into the same problem as Jo as well. Please let me know if there are any suggestions or possible solutions. We have been trying to diagnose the problem for a couple days but still have not found a solution. Thanks.

anonymous-matomo-user commented 11 years ago

Replying to wpballard:

Checking in on the IIS logs not importing issue. I'm having the same issue as Jo reported here. The errors are the same.

One thing I've noticed is that --dry-run works perfectly. That might help narrow down where the problem is. Likely in the code that commits the changes to the DB.

anonymous-matomo-user commented 11 years ago

Hey Folks,

Glad to see there is good interest in the log file processing.

The first feature I would like to see added is the opposite of --exclude-path, would be --include-path

In our architecture we have MANY web assets under a single domain and weblogs are done by domain. This is out of our control. This would include multiple applications, API's, and web services. it would be nice to process the log files by including only the paths we want. The exclusion route is just cumbersome as each call would require 5-10 excludes instead of a single include.

anonymous-matomo-user commented 11 years ago

The Second Feature I would like to see is support for the XFERLOG format (http://www.castaglia.org/proftpd/doc/xferlog.html) for handling FTP logs.

much of our business is based on the downloading of data and files via FTP, so these types of stats and analysis is valuable.

anonymous-matomo-user commented 11 years ago

The third feature I would like to see added today is the ability to process log files rotated on a monthly basis. I know this goes contrary to the recommendations however in our business we do not manage the IT infrastructure, only the line of business services and apps on top of that infrastructure.

Currently I am handling this by way of a BASH script. Before I process the log file I count the number of lines (using $wc -l) then I store that in a loglines.log file. The next time I run the script I use tail on the loglines.log and grab the last line count ans use that to populate --skip param.

To capture the monthly log rotation if the current wc -l is less than loglines.log then I set --skip to zero (0).

it is crude, but works. Having this built in native python would be fairly straight forward and allow for support of rotating monthly.

The added bonus is that the same log file can be processed multiple times in a day even for daily rotated logs. This is a happy compromise between real-time javascript and daily log processing, especially for high volume sites with huge log files.

Cron is handy for this.

cbay commented 11 years ago

Those having errors with IIS: please upload a log file with lines causing the error. A single line is probably causing it, so it'd be better to upload that single line(s) rather than a big file. The skip value will help you find that line.

cbay commented 11 years ago

dsampson: agree for the --include-path suggestion. I'll add it later.

FTP logs: that's definitely not something that should be included to Piwik. You can define your own log format with a regexp, have you tried?

Log rotating: not easy. Right now, the Python script has no memory, so it can't store data (such as the latest position for log files). Besides, how would the script know when the log file has been rotated and we must reset the position?

The real solution, to me, would be that Piwik (the PHP/MySQL part) would know if a log line has already been imported, so that you can basically reimport any log file at any time, and it would skip lines already imported. It cannot be as fast as --skip=n, but it would be safe and easy to use.

anonymous-matomo-user commented 11 years ago

See comments inline...

Replying to Cyril:

dsampson: agree for the --include-path suggestion. I'll add it later.

Thanks for this. Appreciated

FTP logs: that's definitely not something that should be included to Piwik. You can define your own log format with a regexp, have you tried?

For those of us in the big data business a FOSS solution offering all the features of piwik for FTP would be great. An unlikely fork, so thought it could be a posible feature.

Working on the regex for XFERLOG. having trouble re-casting a new regex group based on values of other groups. For instance the date field is not a clean YYYY-MM-DD so I need to figure out how to create a regex group based on values of three other regex groups. I am a regex greenhorn for sure.

Log rotating: not easy. Right now, the Python script has no memory, so it can't store data (such as the latest position for log files). Besides, how would the script know when the log file has been rotated and we must reset the position?

I do it by comparing the last line count to the new one. for instance #linesyesterday will be greater than #linestoday if the logfile has been rotated. I have done logfiles in python using just regular text files in the past. They get big but the head can be severed when it gets too big. a no-sql db approach or data object could also work.

The real solution, to me, would be that Piwik (the PHP/MySQL part) would know if a log line has already been imported, so that you can basically reimport any log file at any time, and it would skip lines already imported. It cannot be as fast as --skip=n, but it would be safe and easy to use.

This would be a good alternative with some hit on performance.

Thanks again for the reply

anonymous-matomo-user commented 11 years ago

Did either of these feature make it into the latest 1.10.1 release?

Replying to dsampson:

See comments inline...

Replying to Cyril:

dsampson: agree for the --include-path suggestion. I'll add it later.

Log Rotation: The real solution, to me, would be that Piwik (the PHP/MySQL part) would know if > a log line has already been imported, so that you can basically reimport any log file at any time, > and it would skip lines already imported. It cannot be as fast as --skip=n, but it would be safe and easy to use.

anonymous-matomo-user commented 11 years ago

Working on the regex for XFERLOG.

Here is my first cut, however the DATE field will not be recognized. Dates in XFERLOG are not like those in Apache logs. I am not sure how to concatenate these groups based on other named groups.

I included some test strings. yes I used the public Google DNS for IP's for privacy reasons.

I captured everything I could according to the EXFER documentation. perhaps overkill but the best way I knew to work through the expression. manpage for XFERLOG here (http://www.castaglia.org/proftpd/doc/xferlog.html)

I also provided the example script call and the output from the script.

Looks like the issue is the DATE group. no surprise. But again I am not sure how to contruct it based on the input.

Any thoughts are appreciated

--------------TEST STRINGS------------------- Mon Nov 1 04:18:56 2012 4 8.8.4.4 1628134 /pub/geobase/official/cded/250kdem/026/026a.zip b o a User@ ftp 0 * Thu Nov 10 04:18:56 2012 4 8.8.4.4 1628134 /pub/geobase/official/cded/250kdem/026/026a.zip b o a User@ ftp 0 * c Tue Jan 1 14:12:36 2013 1 8.8.4.4 88048 /pub/cantopo/250k_tif/MCR201001.tif b o a ftp@example.com ftp 0 * i Tue Jan 1 14:15:57 2013 4 8.8.4.4 8769852 /pub/geott/ess_pubs/211/211354/gscof_3759r_b_2000mn01.pdf b o a googlebot@google.com ftp 0 * c Tue Jan 1 16:06:49 2013 11 8.8.4.4 7198877 /pub/toporama/50k_geo_tif/095/d/toporama_095d02geo.zip b o a user@server.com ftp 0 * c Tue Jan 1 17:10:54 2013 1 8.8.4.4 168502 /pub/geott/eo_imagery/gcdb/W102/N49/N49d50mW102d12m2.tif b o a googlebot@google.com ftp 0 * c Tue Jan 1 17:10:54 2013 1 8.8.4.4 168502 /pub/geott/eo_imagery/gcdb/W102/N49/N49d50mW102d12m2.tif b o a googlebot@google.com ftp 0 * c Tue Jan 1 06:59:59 2013 1 8.8.4.4 1679 /pub/geott/eo_imagery/gcdb/W073/N60/N60d50mW073d40m1.summary b o a googlebot@google.com ftp 0 * c Tue Jan 1 07:02:53 2013 1 8.8.4.4 168087 /pub/geott/eo_imagery/gcdb/W108/N50/N50d58mW108d28m3.tif b o a googlebot@google.com ftp 0 * c Tue Jan 1 07:04:39 2013 1 8.8.4.4 16958 /pub/geott/cli_1m/e00pro/fcomfins.gif b o a googlebot@google.com ftp 0 * c

--------------REGEX Expression----------------- (?x) (?P<weekday>Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s (?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\s (?P<day>[\d]{1,})\s (?P<time>[\d+:]+)\s (?P<year>[\d]{4})\s

(?P<unknown>[\d]+)\s (?P<ip>[\d]{1,3}.[\d]{1,3}.[\d]{1,3}.[\d]{1,3})\s (?P<length>[\d]{,})\s (?P<path>/[\w+/]+)/ (?P<file>[\w\d-]+.\w+)\s (?P<type>[a|b])\s (?P<action>[C|U|T|_])\s (?P<direction>[o|i|d])\s (?P<mode>[a|g|r])\s (?P<user>[\w\d]+@|[\w\d]+@[\w\d.]+)\s (?P<service>[\w]+)\s (?P<auth>[0|1])\s (?P<userid>[*])\s (?P<status>[c|i]) (?P<stuff>)

----------------Script Call---------------- ./misc/log-analytics/importlogs.py --url=http://PIWIKSERVER --token-auth=AUTHSTRING --output=proclogs/procFtpPiwik.log --enable-reverse-dns --idsite=17 --skip=0 --dry-run --log-format-regex="(?x)(?P<weekday>Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s\s(?P<day>[\d]{1,})\s(?P<time>[\d+:]+)\s(?P<year>[\d]{4})\s(?P<unknown>[\d]+)\s(?P<ip>[\d]{1,3}.[\d]{1,3}.[\d]{1,3}.[\d]{1,3})\s(?P<length>[\d]{,})\s(?P<path>/[\w+/]+)/(?P<file>[\w\d-]+.\w+)\s(?P<type>[a|b])\s(?P<action>[C|U|T|])\s(?P<direction>[o|i|d])\s(?P<mode>[a|g|r])\s(?P<user>[\w\d]+@|[\w\d]+@[\w\d.]+)\s(?P<service>[\w]+)\s(?P<auth>[0|1])\s(?P<userid>[*])\s(?P<status>[c|i])(?P<stuff>)"- -------------Script output ---------------------- 0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) Parsing log logs/ftpLogsJunco/xferlog2... Traceback (most recent call last): File "./misc/log-analytics/import_logs.py", line 1411, in <module> main() File "./misc/log-analytics/import_logs.py", line 1375, in main parser.parse(filename) File "./misc/log-analytics/import_logs.py", line 1299, in parse date_string = match.group('date') IndexError: no such group

anonymous-matomo-user commented 11 years ago

@ottodude125 and @elm: I have the same issue and reported it as a separate ticket here: #3757#ticket

anonymous-matomo-user commented 11 years ago

How to exclude more than 150 user's visits on site?

anonymous-matomo-user commented 11 years ago

Replying to Cyril:

Those having errors with IIS: please upload a log file with lines causing the error. A single line is probably causing it, so it'd be better to upload that single line(s) rather than a big file. The skip value will help you find that line.

My web logs have additional fields logged. Some of these do resolve/transfer over when using AWStats, others are excluded in AWStats with %other% values. I tried to exclude the additional field data by creating new, but unused lines in the import iis format section but was not able to get past the error "'IisFormat' object has no attribute 'regex'". Forum/web searches bring this up as a common problem but I haven't found a fix. Any suggestions? Sample log file inline.

#Software: Microsoft Internet Information Services 7.5
#Version: 1.0
#Date: 2013-02-23 00:00:01
#Fields: date time s-ip cs-method cs-uri-stem cs-uri-query s-port cs-username c-ip cs(User-Agent) sc-status sc-substatus sc-win32-status time-taken
2013-02-23 00:00:01 192.168.1.202 GET /pages/AllItems.aspx - 443 DOMAIN\username 2.3.4.5 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+6.1;+WOW64;+Trident/4.0;+chromeframe/24.0.1312.57;+SLCC2;+.NET+CLR+2.0.50727;+.NET+CLR+3.5.30729;+.NET+CLR+3.0.30729;+Media+Center+PC+6.0;+.NET4.0C;+.NET4.0E;+InfoPath.3) 200 0 0 499
2013-02-23 00:00:01 192.168.1.202 GET /pages/logo.jpg - 443 DOMAIN\username 2.3.4.5 Mozilla/4.0+(compatible;+MSIE+7.0;+Windows+NT+6.1;+WOW64;+Trident/4.0;+chromeframe/24.0.1312.57;+SLCC2;+.NET+CLR+2.0.50727;+.NET+CLR+3.5.30729;+.NET+CLR+3.0.30729;+Media+Center+PC+6.0;+.NET4.0C;+.NET4.0E;+InfoPath.3) 304 0 0 312
mattab commented 11 years ago

Piwik Log Analytics is now being used by hundreds of users and seems to be working well! We are always interested in new feature requests and suggestions. You can post them here and if you are a developer, please consider opening a pull request

anonymous-matomo-user commented 11 years ago

Hi,

The log analytic script does not accept any time argument. Thus, is it assumed that the log files to be processed have already been filtered ( timestamp range) in order to avoid duplicate processing ?

Thanks.

anonymous-matomo-user commented 11 years ago

Hi

I've been trying to import some logs from a tomcat/valve access log.

According to this http://tomcat.apache.org/tomcat-5.5-doc/config/valve.html, my app server.xml define

<Valve className="org.apache.catalina.valves.AccessLogValve" directory="/sillage/logs/performances" pattern="%h %l %u %t %r %s %b %D Referer=[%{Referer}i]" prefix="access." resolveHosts="false" suffix=".log"/>

Here is a couple of line from one of my access-datetime.log

10.10.40.85 - - [08/Apr/2013:11:02:49 +0200] POST /...t.do HTTP/1.1 200 39060 629 Referer=[http://.....jsp]
10.10.40.60 - - [08/Apr/2013:11:02:49 +0200] GET /...e&typ_appel=json HTTP/1.1 200 2895 2 Referer=[-]
10.10.40.85 - - [08/Apr/2013:11:02:48 +0200] POST /...r.jsp?cmd=tracer HTTP/1.1 200 90 63 Referer=[http://....jsp]

Shortly said, trying to get the proper --log-format-regex has been a nightmarish failure. Improving the documentation on this complex but sometime unavoidable option is necessary. Having a simple array matching the usual

%h => (?P<host>[\\\\w\\\\-\\\\.\\\\/]*)(?::\\\\d+)?

(guess reading README exemple...) would help. Maybe...

oliverhumpage commented 11 years ago

Replying to lyrrr:

Shortly said, trying to get the proper --log-format-regex has been a nightmarish failure. Improving the documentation on this complex but sometime unavoidable option is necessary. Having a simple array matching the usual

%h => (?P<host>[\\\\w\\\\-\\\\.\\\\/]*)(?::\\\\d+)?

(guess reading README exemple...) would help. Maybe...

If you're using --log-format-regex on the command line then I don't think the escaping is necessary. It's only if you're piping directly to piwik via (in my case) apache's ability to send logs to programmes that you need to work out how to do the multiple-escape thing.

anonymous-matomo-user commented 11 years ago

I'll try tomorrow, but I'm skeptical: I copied the \ stuff from the README.md example.

oliverhumpage commented 11 years ago

I've just double-checked the README.md, and the only time I can see that weird escaping is in the bit I wrote called "Apache configuration source code". It's meant to be apache config, not CLI - apologies if that's not clear.

You may need to put a bit of escaping in depending on your shell, but nowhere near the amount that apache requires (since you've got to escape the initial parsing of the config file, then the shell escaping as it runs the command, and still be left with backslashes).

I think if you single quote it's mostly OK, i.e. with tcsh or bash

--log-format-regex='(?P<host>[\w...])'

would pass the regex in unscathed, or with my copy of ancient sh you just need one extra backslash, i.e.

--log-format-regex='(?P<host>[\\w...])'

etc.

HTH

mattab commented 11 years ago

Maybe we are missing a few examples in the doc for how to call the script. Would you mind sharing your examples, if you're reading this?

we will add such help text in the README.

anonymous-matomo-user commented 11 years ago

Okay, finaly this worked:

python misc/log-analytics/import_logs.py --url=http://localhost/piwik log_analysis/access.2013-04-02.log --idsite=1 --log-format-regex='(?P<ip>\S+) (?P<host>\S+) (?P<user_agent>\S+) \[(?P<date>.*?) (?P<timezone>.*?)\] (?P<query_string>\S*) (?P<path>\S+) HTTP\/1\.1 (?P<status>\S+) (?P<length>\S+) (?P<time>.*?) (?P<referrer>.*?)'

This would be an interesting example for your doc I guess

I now have to play with piwik to ponder the relevancy of the tool in my use case (analyzing client's call to a server managing schedules, client's information, etc; to get a better idea, a big picture on topic like network/database/cpu). I guess I'm not very clear and twisting piwik out of the "web analyzis" intended usage. Any suggestion on this topic is welcome.

Last technical thing for this post: my time fiels is millisecond, not second. How to specify that?

Thanks for the help!

anonymous-matomo-user commented 11 years ago

I have set this up on a varnish server that is logging through varnishncsa. However, the requests that varnish logs include the host name as the "request."

123.456.78.9 - - [23/Apr/2013:07:05:51 -0400] "GET http://asite.org/thing/471 HTTP/1.1" 200 13970 "http://www.google.com/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)"

When I import this with import_logs.py, piwik was registering hits at http://asite.org/http://asite.org/thing/471 so I was able to work around this by using the log-format-regex parameter.

--log-format-regex='(?P<ip>\S+) \S+ \S+ \[(?P<date>.*?) (?P<timezone>.*?)\] "\S+ https?://asite\.org(?P<path>.*?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.*?)" "(?P<user_agent>.*?)"'

It would be great if this were more directly supported and documented (varnishncsa tracking through import_logs.py). I suspect my method isn't ideal for situations where more than one site is being cached with varnish and also if visitors to those sites are being logged by piwik. This method probably only works with one domain.

oliverhumpage commented 11 years ago

Hi bangpound,

I'm not a piwik dev so I can't comment on including a varnishncsa in the import_logs.py itself, but if you change your regex slightly to replace

https?://asite\.org```

with

(?P<host>https?://[^/]+)```

then that will pick up the hostname of the site and therefore work well with multiple vhosts (either define them in piwik in advance, or use --add-sites-new-hosts to add them automatically).

Hope that helps.

anonymous-matomo-user commented 11 years ago

Similar to cdgraff's request: Feature Request: Support WMS (Windows Media Services) logs. Currently we use Awstats but it would be great to be able to move it to PIWIK.

I have attached a sample of WMS version 9.0 log file: WMS_20130523.log

anonymous-matomo-user commented 11 years ago

I've noticed that the local time for imported logs is not set correctly. Is this correct or am I doing something wrong?

It seems as if Piwik is using the timezone of the web server that created the logs to set the local visitor time. I don't know if this is part of the importer or part of Piwik itself, but I would like to see the local visitor time be related to the timezone the visitor actually is based on their IP geoIP. It should be possible either using approximation based on longitude and latitude or by using a database like GeoNames.

anonymous-matomo-user commented 11 years ago

Hey Folks,

Thought I would inform this thread that I have been working on a batch loading script for those of us that require some extra features such as remembering how many lines in a log were processed. The major use case is for people running scripts through cron jobs on log files roted monthly, but they want to run the stats daily or more frequently than monthly.

You can check out the branch development of batch-loader.py for piwik here:

https://github.com/drsampson/piwik/tree/batch-loader/misc/log-analytics/batch-loader

I would love some testers and feedback. Read the readme here for an overview: https://github.com/drsampson/piwik/blob/batch-loader/misc/log-analytics/batch-loader/readme.md

Developer notes: This work is a branch of a forked version of piwik. My goal is to someday make a pull request to integrate in piwik. So piwik developers are encouraged to comment so I can prepare.

cbay commented 11 years ago

dsampson: I've had a very quick look at your script. The core feature, which is keeping track of already imported log lines, should be done in Piwik itself, as detailed by Matt on this ticket. Using a local SQLite database is an inferior solution.

Your Python code could be better. A few suggestions:

anonymous-matomo-user commented 11 years ago

Thanks for the feedback.

As for developing in Piwik. Python is the extent of this geographers hacking skills. I thought since this was not being done within PIWIK I would create a homebrew solution. Then I convinced myself to offer it back to the community for those who could use it.

Perhaps it will inspire someone to do it the right way within piwik, which would be awesome. Right now it keeps me out of the piwik internals, which is probably best for everyone (smile).

cbay commented 11 years ago

String formatting was a general tip to avoid multiple concatenations. Indeed, it should NOT be used for SQL requests with unfiltered input.

As for having a proper solution to your problem, you might try harassing Matt so that he implements it into Piwik :) Just kidding, but I would LOVE to have it!

mattab commented 11 years ago

Thanks for your submission of this tool that enhances log analytics process use cases.

For the particular "log line skip" feature, Why in core? because if several servers call Piwik, you are in trouble with the SQLite database. Better re-use Piwik datastore to keep track of dupes :)

Here is my updated proposal implementation.

anonymous-matomo-user commented 11 years ago

Matt,

I agree with you that getting it into core would be best. Having this solution means I could possibly dissolve my forked project. Again if I was a PHP and MySQL developer I would love to help. As a geographer, scripting is done on the side to handle special use cases.

For clarification of the use case for this script, it is launched independent of piwik. By that I mean the script will likely reside on a log server somewhere, not the PIWIK server. The script is called likely through a cron job. Since there will only be a single instance of the script run on any server you won't run into collides with multiple servers using it. If you need multiple instances then you will have each with an independant sqlite DB. That is why I used SQLITE because you have only one client accessing the client at any one time.

Let me know when these features are added to core and I will dissolve my fork.

Good luck.

mattab commented 11 years ago

Updated description, adding:

anonymous-matomo-user commented 11 years ago

Request for support of "x-forwarded-for" in cases where load balancing is placed in front of web server when importing log.

Apache Log format is as follow :-

LogFormat "%v %{X-Forwarded-For}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" cplus

Sample Log:- smartstore.oomph.co.id 10.159.117.216, 202.70.56.129 - - +0700 "GET /index.php/nav/get_menu/1/ HTTP/1.1" 200 2391 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"

if noticed, there are 2 ip for remote host(in this case X-forwarded-for parameter. The 1st IP is the "virtual IP/local ip" and the second being the proxy used on a mobile network.

Regular expression when importing log used is as followed-

--log-format-regex='(?P<host>[(?P<ip>\S+) \S+ \S+ (?P<date>.?) (?P<timezone>.?)(?::\d+)?) "\S+ (?P<path>.?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.?)" "(?P<useragent>.?)"'

This works for regular log lines where there is only 1 IP address...

Current Workaround is to add additional field in the import_log.py for additional field for proxy...and run the import again with new regex.

--log-format-regex='(?P<host>[(?P<proxy>\S+), (?P<ip>\S+) \S+ \S+ (?P<date>.?) (?P<timezone>.?)(?::\d+)?) "\S+ (?P<path>.?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.?)" "(?P<useragent>.?)"'

I will be nice if additional support to handle additional x-forwarded-for instead.

anonymous-matomo-user commented 11 years ago

Request for support of "x-forwarded-for" in cases where load balancing is placed in front of web server when importing log.

Apache Log format is as follow :-

LogFormat "%v %{X-Forwarded-For}i %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" cplus

Sample Log:- smartstore.oomph.co.id 10.159.117.216, 202.70.56.129 - - +0700 "GET /index.php/nav/get_menu/1/ HTTP/1.1" 200 2391 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)"

if noticed, there are 2 ip for remote host(in this case X-forwarded-for parameter. The 1st IP is the "virtual IP/local ip" and the second being the proxy used on a mobile network.

Regular expression when importing log used is as followed-

--log-format-regex='(?P<host>[(?P<ip>\S+) \S+ \S+ (?P<date>.?) (?P<timezone>.?)(?::\d+)?) "\S+ (?P<path>.?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.?)" "(?P<useragent>.?)"'

This works for regular log lines where there is only 1 IP address...

Current Workaround is to add additional field in the import_log.py for additional field for proxy...and run the import again with new regex.

--log-format-regex='(?P<host>[(?P<proxy>\S+), (?P<ip>\S+) \S+ \S+ (?P<date>.?) (?P<timezone>.?)(?::\d+)?) "\S+ (?P<path>.?) \S+" (?P<status>\S+) (?P<length>\S+) "(?P<referrer>.?)" "(?P<useragent>.?)"'

I will be nice if additional support to handle additional x-forwarded-for instead.

cbay commented 11 years ago

If you're using a reverse proxy, you really should use something like mod_rpaf so that the recorded IP address for Apache is the correct one (the client, not the proxy). And then you can use the standard log formats.

anonymous-matomo-user commented 11 years ago

Correct me if I am wrong... I'm pretty new to piwik... had used awstats previously.

This would be possible if the log was not older log. We are talking about handling IMPORTING log and not existing...it makes not much sense o me if to ask user to use mod_rpaf when their aim is to import older logs which had not implemented that.

The aim of import is to import older logs...for current tracking, this can already be done by piwik itself.

Replying to Cyril:

If you're using a reverse proxy, you really should use something like mod_rpaf so that the recorded IP address for Apache is the correct one (the client, not the proxy). And then you can use the standard log formats.

cbay commented 11 years ago

I don't get why that won't work with a custom regexp?

oliverhumpage commented 11 years ago

Assuming you want the last IP in the list (and also that you trust the last IP in the list - this is why mod_rpaf is the best idea since you can prevent clients spoofing IPs):

--log-format-regex='(?P<host>[\w-.])(?::\d+)? (?:\S+?, )(?P<ip>\S+)/)

If you want to capture proxy information, I don't think piwik supports that, so you'd need to set up a separate site with an import regex that captures the first IP in the list instead.

anonymous-matomo-user commented 11 years ago

Think the main point here is to "IMPORT" Existing log. For new log, it can be implemented easily as it is all done in java script.

As for "I don't get why that won't work with a custom regexp?" Any idea how/what the regexp can be...sorry I am no expert for regex...which is why I ended up having to process the log twice... and modifying the python script.

anonymous-matomo-user commented 10 years ago

Hi, I'm testing the import, and ran the python script twice on the same log file. It looks like the same log file was processed twice.

Does it mean I have to handle on my own the log file history ? Iow, can you confirm piwik log processor does not remember the starting date and end date of the log files ?

Thanks, Axel

mattab commented 10 years ago

Iow, can you confirm piwik log processor does not remember the starting date and end date of the log files ?

Correct. we would like to add this feature at some point. If you can sponsor it, get in touch!