matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.72k stars 2.63k forks source link

Piwik an alternative to AWStats and Urchin, build server log import script #703

Closed anonymous-matomo-user closed 12 years ago

anonymous-matomo-user commented 15 years ago

Urchin Alternative: Import your server logs in Piwik, the Free web analytics platform!

See blog post Piwik alternative to Urchin for more information.

Piwik is the Urchin alternative but also Webalyzer and AWStats alternative: with a Python script, you can now import webserver logs (apache, iis, and more) in Piwik, instead of using the javascript tracking.

Description A Python script available in piwik/misc/log-analytics/ will parse server logs efficiently and automatically call the Piwik Tracking API to inject the visits/pageviews/downloads in Piwik.

How to install / how to use

SEE FOLLOW UP TICKET #3163

How you can help?

Tasks to do before final release

Feature requests for V2 or later

SEE FOLLOW UP TICKET #3163

mattab commented 12 years ago

First release of the script committed in [6046]

mattab commented 12 years ago

(In [6051]) Refs #703

robocoder commented 12 years ago

(In [6053]) refs #703 - propset eol-style

oliverhumpage commented 12 years ago

Performance-wise: I've set up piwik in its own jail now, turned off unnecessary PHP extensions, tweaked apache, and enabled APC. If I use --recorders=48 I get good import speeds (at least at first) without the load average going too high. However, something odd happens, and some way through importing a log file the recorders drop off (I can see fewer and fewer apache processes too, so clearly it's just not being hit as much):

2846 lines parsed, 233 lines recorded, 233 records/sec
4372 lines parsed, 506 lines recorded, 273 records/sec
[...]
8300 lines parsed, 7570 lines recorded, 9 records/sec
8300 lines parsed, 7579 lines recorded, 9 records/sec
8300 lines parsed, 7588 lines recorded, 9 records/sec
8300 lines parsed, 7598 lines recorded, 10 records/sec

I don't think I have any weird throttling going on - any ideas what might be up? There's nothing else being output during the processing even with debugging on. The drop-off seems to start roughly half way through any given logfile.

cbay commented 12 years ago

oliverhumpage: 48 is almost certainly too high, unless you have a 48-core machines. You shouldn't have to exceed the number of cores in your system, even a bit lower (as the import script and MySQL will run at the same time).

As for why your performance decreases over time, I don't know. What does a 'top' say? You'd have to find the bottleneck. It may be Apache, PHP, MySQL. On my system, I have a sustained 300 req/s for more than 3 hours.

Regarding the static files excluded, we'll add an option to include those (disabled by default). I'm sure the whole importing process will get better over time, it's only the beginning :)

mattab commented 12 years ago

(In [6070]) Refs #703 Removing images from "downloads", and improving TIP message in output debug

mattab commented 12 years ago

(In [6071]) Refs #703 Improving help message as per Cyril feedback

mattab commented 12 years ago

(In [6074]) Refs #703 Display response output when tracking request failed (this happens for example when debug is enabled in piwik.php)

oliverhumpage commented 12 years ago

Replying to Cyril:

oliverhumpage: 48 is almost certainly too high, unless you have a 48-core machines. You shouldn't have to exceed the number of cores in your system, even a bit lower (as the import script and MySQL will run at the same time).

I did quite a few experiments, and eventually found that 40 is about right. This is a VM running on a high powered Dell R710, so although the OS only thinks it has 4 CPUs I don't know how things actually pan out. All I know is that the number of records/sec increases pretty much linearly with --recorders up until 40. E.g. if I run at 32, I get more like 200r/sec rather than 250+r/sec. A single recorder manages around 6-7r/sec. After 40 the benefits tail off.

I also tried a few experiments to see where the bottleneck might lie, for instance I stuck in a mod_rewrite to send the importer to a basic PHP file that just returned the .gif without doing any processing, but weirdly the performance was about the same. However, running with --dry-run (or just removing the line which actually calls the script) means the python script runs at around 4000r/sec, so I can only conclude the limit is in apache/php (putting in APC definitely helped). I also tried hacking the script to run a PHP wrapper script that called piwik.php directly on the command line, but it went horribly slowly, presumably because of the lag in loading up PHP.

Anyway, I'm happy with 250-300r/sec. I may set up a separate VM with a tweaked kernel and optimised apache to deal with log imports anyway, so I'm sure I can improve on that figure.

Regarding the steady tailing-off, what I'm wondering is: when you specify lots of recorders, do they each grab an equal number of log lines at the start then work through them? That would explain why some finish earlier than others (if e.g. one gets a lot with non-loggable lines it'd finish sooner). I notice the number of apache processes starts tailing off around half to 2/3 of the way through the log, and then just steadily decline until only 1 recorder is left.

Regarding the static files excluded, we'll add an option to include those (disabled by default). I'm sure the whole importing process will get better over time, it's only the beginning :)

That'd be brilliant, thank you. Thanks to you all for being so responsive in general too.

Oliver.

mattab commented 12 years ago

FYI the new 1.7.2-rc4 was released which includes the most up to date code: Download from: http://builds.piwik.org/?C=N;O=D

mattab commented 12 years ago

oliverhumpag, thanks for your comments it's very interesting! Since you seem keen, maybe you can consider running XHProf, the facebook php profiler: http://pecl.php.net/package/xhprof

I haven't run that for a long time and never under high load such as 300 req/s so it would be very interesting. If you install it, i would love to see the reports generated! The last time we ran XHPRof on Piwik we found 2-3 quick fixes that made things a lot better. I'm sure we can make tracker faster in many ways.

It would also be good to know the % of consumption of Apache/php VS mysql (not sure the best way to do this however?).

cbay commented 12 years ago

oliverhumpage: regarding the recorders, each request will be dispatched to a specific recorder based on its IP address. It means that if the IP address distribution of your log files isn't "even", some recorders will have more work to do than others. Which could explain the performance issues you're having, especially near the end of the import process.

This dispatching was required to make sure requests are imported in the correct order.

oliverhumpage commented 12 years ago

Actually, I do have one small request for piwik itself.

Would it be possible to choose on the fly between multiple database options: you see, I'm using one physical install of piwik at 2 different URLs - one for JS-based sites, and one for log-based, and therefore also 2 different sets of db tables so that --add-sites-new-hosts on the log-based system doesn't interfere with the JS websites (they'd have the same URLs). What I've done atm is set an environment var in apache and patch core/Config.php to set $config->database to either $config->database_weblog or $config->database_js depending on that env var.

However, being able to define a constant like DATABASE_CONFIG_SECTION_NAME in bootstrap.php, which Config.php then used to work out which section of the config file to use, would be much easier and more robust. I could of course just have 2 different installs of piwik, but then I have to update it twice with each release. Probably not worth enlarging the codebase just for my weird setup, but thought I'd ask - I can easily submit a patch if you're interested.

cbay commented 12 years ago

(In [6092]) Refs #703 import-logs.py renamed to import_logs.py and added a mini test suite which tests the format autodetection.

cbay commented 12 years ago

(In [6093]) Refs #703 Many improvements:

cbay commented 12 years ago

(In [6094]) Refs #703 Added option --output to redirect output to a file.

mattab commented 12 years ago

(In [6100]) Refs #703

mattab commented 12 years ago

(In [6102]) Refs #703 Add license notice, Shuffle help messages order, remove short notation for clarity, improve help messages, adding Java/ + bot- + bot/ + robot as a bot

mattab commented 12 years ago

(In [6108]) Refs #703 I'm learning Python (NOT!)

cbay commented 12 years ago

(In [6128]) Refs #703 Now works with Python 2.5.

cbay commented 12 years ago

(In [6129]) Refs #703 Show the summary when CTRL+C is pressed.

cbay commented 12 years ago

(In [6130]) Refs #703 Fixed bug with --log-format-regex (thanks oliverhumpage).

cbay commented 12 years ago

(In [6131]) Refs #703 Disable buffering when using --output.

cbay commented 12 years ago

(In [6132]) Refs #703 Added --query-string-delimiter

cbay commented 12 years ago

(In [6133]) Refs #703 Added --enable-http-errors and --enable-http-redirects

cbay commented 12 years ago

(In [6134]) Refs #703 Pretty print archives dates.

cbay commented 12 years ago

oliverhumpage: thanks for the bug report and the suggestions, I've normally committed everything you asked :)

Regarding the persistent connections, I haven't patched anything. It's a builtin feature of PHP/mysqli, see:

http://www.php.net/manual/en/mysqli.construct.php

"Prepending host by p: opens a persistent connection."

mattab commented 12 years ago

(In [6135]) Refs #703

mattab commented 12 years ago

(In [6137]) Refs #703 README update + fixing --enable-reverse-dns now works + adding common bot names

cbay commented 12 years ago

(In [6140]) Refs #703 Catch URL exceptions during configuration

mattab commented 12 years ago

(In [6155]) Adding advanced use case in the README. Thanks Oliver for your help and submission!! Refs #703

oliverhumpage commented 12 years ago

Cyril:

Have tested using - instead of /dev/stdin, seems to work fine.

Re the regex, I think that's explained in the comments: because I want it to pick up hostnames that are subsites and so have slashes (e.g. I want the hostname 'domain.com/subsite' to be picked up and created with that name in piwik), I needed to amend the normal vhost regex to allow "/" in the host character class. It's also a very good example of what and how to escape shell special characters in apache log pipes :)

(I spent a fun morning with a test apache installation and a perl script testing each special character in turn until I got it working... then a fun afternoon wondering why it wasn't working with import_logs.py, until I realised there wasn't a .compile for the custom regex!)

I did originally put things like "domain.com.subsite" in the hostname so the standard regex would work, but it looks ugly and non-user-friendly in piwik.

mattab commented 12 years ago

(In [6157]) refs #703 updating README as per feedback. please comment if the code does not work I haven't tested myself

mattab commented 12 years ago

Updated ticket with suggestions on tto improve script performance (ie. we should bulk send 50 requests at once in POST to have 50 times less http requests...) !!

oliverhumpage commented 12 years ago

Just regarding the persistent database connections: using "p:localhost" only works for mysqli after PHP 5.3. It didn't work for me since we're still on 5.2 (going to upgrade soon...).

cbay commented 12 years ago

Matt: that should do it I guess. I'll try to make the changes ASAP.

Sending bulk requests would be great, I'm sure that would improve the performance a lot!

anonymous-matomo-user commented 12 years ago

The script doesn't parse IIS6 or IIS7 log files (not tried IIS8). I tried the following regex that matches the log lines in kiki but no luck with the script. Any pointers?

Also minor change to line 1068

' the --format option'

needs updating to

' either the --log-format-name or --log-format-regex option'
_IIS6_FORMAT = (
    '(?P<date>^\d+[-\d+]+ [\d+:]+) '
    '\S+ \S+ [\d*.]+ \S+ '
    '(?P<path>\S+) '
    '\S+ \d+ \S+ '
    '(?P<ip>[\d*.]*) '
    '\S+ '
    '(?P<user_agent>\S+) '
    '\S+ '
    '(?P<referrer>\S+) '
    '\S+ '
    '(?P<status>\d+) '
    '\S+ \S+ '
    '(?P<length>\S+)'
)
anonymous-matomo-user commented 12 years ago

+1

The only way I got it to work with IIS7 (just to test out the script) was to convert it to ncsa extended.

mattab commented 12 years ago

Can you please post example log format that does not work ?

cbay commented 12 years ago

(In [6165]) Refs #703 Fixed bug: stats.piwik_sites should not have None items.

cbay commented 12 years ago

(In [6166]) Refs #703 Only show tips in summary if necessary.

cbay commented 12 years ago

(In [6167]) Refs #703 Added --exclude-path and --exclude-path-from.

cbay commented 12 years ago

(In [6168]) Refs #703 Replaced tabs with spaces.

anonymous-matomo-user commented 12 years ago

Replying to matt:

Can you please post example log format that does not work ?

About 20 lines from two logs.

http://mike.org.uk/iis6_iis7_log.txt

cbay commented 12 years ago

(In [6169]) Refs #703 Added custom variable Not-Bot.

cbay commented 12 years ago

(In [6170]) Refs #703 Updated error string.

anonymous-matomo-user commented 12 years ago

Trunk version works well also without --idsite-fallback.

mattab commented 12 years ago

Cyril, thanks for the recent fixes, very nice!!

anonymous-matomo-user commented 12 years ago

for some years I was looking for an alternative to awstats and with your import script I think I've found it - great work so far.

But I've troubles with the log file. We use a Lotus Notes clusters and for each server in the cluster we've a seperate log file per day. The import is working but the result isn't ok and I think it because of the log file format.

It looks like this: 192.168.1.1 bene.com - +0200 "GET /mobiliario-de-oficina/news-filo-design-preis-2009.html HTTP/1.1" 200 20719 "" "Mozilla/5.0 (compatible; Ezooms/1.0; ezooms.bot@gmail.com)" 1453 "" "D:/Notes/Data/benecom/cont_es.nsf"

in awstats I can describe the log file format like this: LogFormat=%host %virtualname %lognamequot %time1 %methodurl %code %bytesd %refererquot %uaquot %other %other %other

and http error and redirects are also not found: 17758 requests imported successfully 542 requests were downloads 0 requests ignored: 0 invalid log lines 0 requests done by bots, search engines, ... 0 HTTP errors 0 HTTP redirects 0 requests to static resources (css, js, ...) 0 requests did not match any known site 0 requests did not match any requested hostname

See more log at: http://pastebin.com/zSMXqEpu

mattab commented 12 years ago
Traceback (most recent call last):
  File "C:\Python26\lib\threading.py", line 522, in __bootstrap_inner
    self.run()
  File "C:\Python26\lib\threading.py", line 477, in run
    self.__target(*self.__args, **self.__kwargs)
  File "c:\wamp\www\piwik\misc\log-analytics\import-logs.py", line 756, in _run
    self._record_hit(hit)
  File "c:\wamp\www\piwik\misc\log-analytics\import-logs.py", line 794, in _reco
rd_hit
    'url': main_url + hit.path[:1024],
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 148: ordinal not in range(128)