matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.68k stars 2.62k forks source link

Log analytics list of improvements #3163

Closed mattab closed 9 years ago

mattab commented 12 years ago

In Piwik 1.8 we released the great new feature to import access logs and generate statistics.

The V1 release works very well (it was tracked in #703), but there are ideas to improve it. This ticket is a placeholder of all ideas and discussions related to the Log Analytics feature!

New features

PERFORMANCE'

How to debug performance? First of all, you can run the script with --dry-run to see how many log lines per second are parsed. It typically should be between 2,000 and 5,000. When you don't do a dry run, it will insert new pageviews and visits calling Piwik API.

Other tickets

anonymous-matomo-user commented 12 years ago

Some IIS logs do the same as bjrubble mentioned in the comment above - for their c-ip section, a host name may be found instead of just an IP address.

This causes the regex (which only accept digits) to fail when parsing that line, and I believe the line gets thrown out, resulting in a bad import.

anonymous-matomo-user commented 12 years ago

Because Piwik lacks the capabilities of tracking news feed subscribers (and I don't want to use feedburner) I would like to import the particular information from the Apache logs. All other web requests are tracked successfully by Piwik and I want the feed users information merged into the same Piwik website. For instance my news feed is located at www.domain.com/rss.xml, how can I import only the particular information into Piwik?

anonymous-matomo-user commented 12 years ago

Hi guys,

We found one odd case.

On 2 servers (one dedicated and one vps) each new visit = new idvisitor (despite same configId).

BUT the same log file, the same piwik (fresh download and installation) on localhost at mac os x uniqe visitors are counted correctly.

Do you have any ideas why and how it supposed to work? I've spent some time in visit.php and when no cookie and visit less than 30 minutes = new idvisitor.

mattab commented 12 years ago

BUT the same log file, the same piwik (fresh download and installation) on localhost at mac os x uniqe visitors are counted correctly.

Could you somehow find an example of the log file showing the problem on both installations, with a few lines like 3 or 4 lines, to replicate the bug? this would help finding out the fix. Thanks

anonymous-matomo-user commented 12 years ago

Yes matt , I will have them tomorrow (day off today) but how it should work? Should log parsing count unique visitors or not ?

geosone commented 12 years ago

I have activated logimport via apache macro to have live stats but wee have a 20 sites with high load and the problem that we have now ist that the acces via the url is blocking (30 or more import_log.py accessing piwik) could we get some direct logimport that si not going throug the http interface ? and directly trhoug a console php load ?

thx Mario and keep up the great work

aspectra commented 12 years ago

Hi @all

We are testing the python import_logs.py script. Actually we are not able to import IIS log files which are compressed by WinZip or 7zip. If we unzip the archive befor running the script it works quiet well.

It seems the python script is not able to uncompress the files...

Attached an example archive

diosmosis commented 12 years ago

(In [6734]) Refs #3163, add integration tests (in PHP) for log importer.

diosmosis commented 12 years ago

(In [6737]) Refs #3163, modified log importer to use bulk tracking capability.

Notes:

mattab commented 12 years ago

(In [6739]) Refs #3163 - clarifying this option shouldn't be used by default

diosmosis commented 12 years ago

(In [6740]) Refs #3163, made size of parsing chunk == to max payload size * recorder count.

mattab commented 12 years ago

(In [6743]) Refs #3163

TODO:

mattab commented 12 years ago

(In [6745]) Fixing build? Refs #3163

mattab commented 12 years ago

(In [6749]) Refs #3163

diosmosis commented 12 years ago

(In [6756]) Refs #3163, show average records/s along w/ current records/s in log importer.

mattab commented 12 years ago

Replying to jamesvl011:

Some IIS logs do the same as bjrubble mentioned in the comment above - for their c-ip section, a host name may be found instead of just an IP address.

This causes the regex (which only accept digits) to fail when parsing that line, and I believe the line gets thrown out, resulting in a bad import.

@jrbubble and james, could you please submit the correct REGEX? we would be glad to commit the fix, thanks.

mattab commented 12 years ago

Adding "Heuristics to not track bot visits" in the ticket description.

If you have a suggestion or request for the script - or any problem or bug, please post a new comment here.

mattab commented 12 years ago
anonymous-matomo-user commented 12 years ago

Replying to matt:

@jrbubble and james, could you please submit the correct REGEX? we would be glad to commit the fix, thanks.

Matt -

The regex for c-ip (line 134 of import_logs.py when I looked at svn) ought to be like the line for User-Agent, allowing any text string without spaces:

'c-ip': '(?P<ip>\S+)'

I'm assuming the Piwik API can handle an IP address input as host name? If not, Python will have to do hostname lookups (preferably with its own mini-cache) as it parses the file.

I'll attach a file to this ticket with an example IIS log file that you can use for testing - it will have four rows, three with host names in the c-ip field and one with an IP address.

oliverhumpage commented 12 years ago

I've just tried a fresh install of 1.8.3 (to make sure it works before I move everything over from my current 1.7.2rc4 install).

When I import a sample log (for just one vhost) using --add-sites-new-hosts, I get the same "website" created multiple times. It seems that if you set --recorders to something greater than 1, then several recorders will independently create the new vhost's website for you. Changing --recorder-max-payload-size doesn't seem to affect this behaviour, it's just --recorders.

I'm sure this didn't happen in the older 1.7.2 version.

Can you replicate, and if so, is there an easy fix?

Thanks.

diosmosis commented 12 years ago

(In [6824]) Refs #3163, fix concurrency bug in import script where sites get created more than once when --add-sites-new-hosts is used.

diosmosis commented 12 years ago

Replying to oliverhumpage:

I've just tried a fresh install of 1.8.3 (to make sure it works before I move everything over from my current 1.7.2rc4 install).

When I import a sample log (for just one vhost) using --add-sites-new-hosts, I get the same "website" created multiple times. It seems that if you set --recorders to something greater than 1, then several recorders will independently create the new vhost's website for you. Changing --recorder-max-payload-size doesn't seem to affect this behaviour, it's just --recorders.

I'm sure this didn't happen in the older 1.7.2 version.

Can you replicate, and if so, is there an easy fix?

Just committed a fix for this bug. Can you use the file in svn?

diosmosis commented 12 years ago

(In [6826]) Refs #3163, added more integration tests for log importer & removed some unnecessary xml files.

oliverhumpage commented 12 years ago

Replying to capedfuzz:

Just committed a fix for this bug. Can you use the file in svn?

Perfect, that's fixed it - thank you.

Oliver.

diosmosis commented 12 years ago

(In [Refs #3163, #3227, make sure no exception thrown in tracker when no 'ua' parameter & no HTTP_USER_AGENT. (fix for bug in 6737)).

anonymous-matomo-user commented 12 years ago

I'm trying to import our IIS logs using import_logs.py but it keeps hitting a snag somewhere in the middle. The message simply says:

Fatal error: None
You can restart the import of "d:\tmp\logfiles\ex120803.log" from the point it failed by specifying --skip=215201 on the command line.

When I restart it with the skip parameter, it would not record any more lines and fail again a few lines down (see output below)

C:\Python27>python "d:\websites\piwik\misc\log-analytics\import_logs.py" --url=h
ttp://piwikpre.unaids.org/ "d:\tmp\logfiles\ex120803.log" --idsite=2 --skip=2152
01
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log d:\tmp\logfiles\ex120803.log...
182921 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
218630 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
222550 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
227111 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
231539 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
235666 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
240261 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
244780 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Fatal error: None
You can restart the import of "d:\tmp\logfiles\ex120803.log" from the point it failed by specifying --skip=215225 on the command line.

The format we are using is W3C Extended Log File Format and we are tracking extended properties, such as Host, Cookie, and Referer. I'd like to send the log file that I used for this example, but it's too big to be attached (20Mb even when zipped). Can I send it by some other means?

Thanks a lot! -Jo

anonymous-matomo-user commented 12 years ago

Hi,

Nice module we're currently assessing. I have 2 questions :

1/ We have several servers load balanced. Each server is generating its own log files, but for the same FQDN. How can we process and aggregate the log files within the same Website, as the log files need to be order by date ?

2/ Log files contain consumed bandwidth. Is it envisageable to enhance this module in order to parse and log this information ? Or if we need this information, should we consider to create a plugin ?

Thanks for your feedback.

anonymous-matomo-user commented 12 years ago

The import_logs.py script should be able to handle and order the dates of your differents logs when computing statistics. It's the main purpose of the "invalidate" function within this script.

The best would be to import all your logs at once and then to run the archive job so that it can compute statistics for the "invalidated" dates.

anonymous-matomo-user commented 12 years ago

Hi,

I try to use "import_logs.py" to parse the Java Play's log, log file sample as follows:

15.185.97.217 127.0.0.1 - - Sep 04 18:28:38 PDT 2012 "/facedetect?url_pic=http%3A%2F%2Ffarm4.staticflickr.com%2F3047%2F2699553168_325fb5509b.jpg" 200 345 "" "Jakarta Commons-HttpClient/3.1" 5683 ""

But the Python thrown: "invalid log lines"

Actually the Java Play's log file is similar with Lighttpd's access.log,Any easy way to adapter this Python file to parse other log file?

anonymous-matomo-user commented 12 years ago

It was suggested by Matt that I add my issue to this ticket:

I'm running Piwik 1.8.3 on IIS 7. I've installed the GeoIP plugin, and also tweaked based on http://forum.piwik.org/read.php?2,71788. It is working. However, my installation is only tracking US-based visits.

My IIS instance archives its log hourly. I've attached one recent log for review, on the chance that it will contain clues as to why I'm only seeing US-based visits.

anonymous-matomo-user commented 12 years ago

Attached log file is named u_ex12091212.log.

diosmosis commented 12 years ago

[7030] refs this ticket.

anonymous-matomo-user commented 11 years ago

I have a log with the following format where www.website.com represents the hostname of the web hosts hosted on the server. I get an error that the log format doesn't include the hostname.

188.165.230.147 www.website.com - -0400 "GET / HTTP/1.1" 200 10341 "http://www.orangeask.com/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" "-"

I have tried a series of tests with --log-format-regex= and I can't get it to work. Any help would be greatly appreciated.

Thanks

mattab commented 11 years ago

To everyone with questions in this ticket, thank you for your bug reports. You can try to modify the script python to make it work for your log files. It's really simple code at the start of the script.

If you are stuck and need help, Piwik experts can help with any issue related to the log import script. Contact them at: http://piwik.org/consulting/

Otherwise, we may fix some of these requests posted here, but it might take a while..

We hope you enjoy Log Analytics!

anonymous-matomo-user commented 11 years ago

Replying to jason:

I have a log with the following format where www.website.com represents the hostname of the web hosts hosted on the server. I get an error that the log format doesn't include the hostname.

188.165.230.147 www.website.com - -0400 "GET / HTTP/1.1" 200 10341 "http://www.orangeask.com/" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) ; .NET CLR 2.0.50727; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729)" "-"

I have tried a series of tests with --log-format-regex= and I can't get it to work. Any help would be greatly appreciated.

Thanks

Last time, I have adapted the code base at "import_logs.py" esp for Java Play log file parsing successfully, then I think you should hard code to remove the hostname pattern with "http://" prefix, or string replace it.

anonymous-matomo-user commented 11 years ago

Had a very minor problem with the script today: I have daily log rotation enabled, and when no user visits a site on a given day, the log file for that day will be empty. This means the log format guessing fails, leading to an error. Preferably, when a log file is empty, one would like to skip the file without throwing an error. This is easily achieved by changing the line that checks for log file existence to also check if the log file has contents:

        `if not os.path.exists(filename) or os.path.getsize(filename) == 0:`
mattab commented 11 years ago

@cyril in the next update, can yu please include this patch from @phikai: "Log Parser README Update with Nginx Log Format for Common Complete"

To everyone else: please consider submitting patches, README improvements, or new log format in the script, we will make an update in a few days.

mattab commented 11 years ago

(In [7313]) Refs #3163 Adding libwww in excluded user agents, since libwww-perl is a common bot As reported in: http://forum.piwik.org/read.php?3,95844

cbay commented 11 years ago

(In [7382]) Refs #3163: Log Parser README Update with Nginx Log Format for Common Complete, thanks to phikai.

cbay commented 11 years ago

(In [7383]) Refs #3163: don't fail to autodetect the format for empty files.

mattab commented 11 years ago

Hey guys, there have been many updates in trunk on the script, please let us know if your suggestion or report hasn't yet been committed.

Kuddos Cyril for the updates!

edit: Check also this ticket: #3558

cbay commented 11 years ago

For the record, with the current trunk, I can sustain 2000 requests/second in dry-run mode on a Xeon 2.7 GHz. And 1000 requests/second without dry-run, with --recorder=10 and the default payload (Piwik is installed on another server, 4 cores).

Not to say that you should get the same numbers as it depends on a LOT of factors (raw processing power, number of recorders, payload, PHP configuration, log files, network, etc.), but if you only get 50 requests/second and you have a strong machine, something is probably wrong.

Running with --dry-run is a good way to know how fast the Python script can go without really importing to Piwik, which already excludes many factors.

anonymous-matomo-user commented 11 years ago

I am running Piwik 1.9.2 on a RHEL 5.7 server running Apache.

I am trying to implement the Apache CustomLog that directly imports into Piwik as described in this [url=https://github.com/piwik/piwik/blob/master/misc/log-analytics/README]README[/url]. I am not sure if I have a problem with my configuration or if there is a potential bug in the Piwik import_logs.py script. After some poking around on the command line it seems that the script works perfectly when it is given an entire file but when you try to feed it a single line from a log file it crashes. I have included my cmd output below for you to view. Any help would be greatly appreciated. Also if you need any additional information please let me know!!

Firstly let me pull the first line of my logfile to show its syntax:

[katonj@mimir2:log-analytics ] $ head -1 boarddev-beta.teradyne.com.log
boarddev-beta.teradyne.com 131.101.52.31 - - [12/Nov/2012:11:16:24 -0500] "GET /boarddev/ HTTP/1.1" 200 10541 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.4 (KHTML, like Gecko) Chrome/22.0.1229.94 Safari/537.4"

Now when I run the file as the apache configuration suggests I get the following (Note: if I do not put the "-" at the end of the command the line from the logfile is ignore and the script simply outputs the README file):

[katonj@mimir2:log-analytics ] $ head -1 boarddev-beta.teradyne.com.log | ./import_logs.py  --add-sites-new-hosts --config=../../config/config.ini.php --url='http://boarddev-beta.teradyne.com/analytics/' -
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log (stdin)...
Traceback (most recent call last):
  File "./import_logs.py", line 1462, in <module>
    main()
  File "./import_logs.py", line 1426, in main
    parser.parse(filename)
  File "./import_logs.py", line 1299, in parse
    file.seek(0)
IOError: [Errno 29] Illegal seek

And finally if I run the file itself through the script I get the following showing that it loves processing the logfile as long as it gets an entire file fed to it all at once:

[katonj@mimir2:log-analytics ] $ ./import_logs.py  --add-sites-new-hosts --config=../../config/config.ini.php --url='http://boarddev-beta.teradyne.com/analytics/' boarddev-beta.teradyne.com.log
0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)
Parsing log boarddev-beta.teradyne.com.log...
Purging Piwik archives for dates: 2012-11-12
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.

Logs import summary
-------------------

    8 requests imported successfully
    0 requests were downloads
    0 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    8 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:

Performance summary
-------------------

    Total time: 0 seconds
    Requests imported per second: 24.01 requests per second
cbay commented 11 years ago

ottodude125: log detection + reading from stdin is actually not supported; you have to pick one. I'll fix the bug later on though.

anonymous-matomo-user commented 11 years ago

When you setup the apache customlog you are piping the system log messages into the script as soon as they appear. This is the same as stdin right? I was just trying to simulate that process by running the head -1 on a log file to get a log message and piping that into the script.

oliverhumpage commented 11 years ago

Since auto format detection relies on having several lines to decode, it doesn't work on stdin (it tries to seek to points in the file, hence the "bug" - seek obviously fails on stdin).

When using stdin as the log source you have to use either --log-format-name or --log-format-regex flags on the command line to force a particular format. You might find --log-format-name="common_vhost" is what you want.

anonymous-matomo-user commented 11 years ago

So you are complete right. Adding --log-format-name='common_vhost' to the command now allows a logfile to be read in from stdin on the command line. So running the following command works great from the command line:

[katonj@mimir2:applications ] $ head -8 babyfat | /hwnet/dtg_devel/web/beta/applications/piwik/misc/log-analytics/import_logs.py --add-sites-new-hosts --url='http://mimir2.icd.teradyne.com/analytics' --log-format-name='common_vhost' --output=/tmp/junk.log -

As a side note I've tried the common_complete name and I tried using the --log-format-regex included in the readme and neither of them had any magical side effects either

Unfortunately porting that exact same thing into the apache http.conf file does not work. I have the configuration below and while the logfile "babyfat" gets populated piwik doesnt seem to process any input.

LogFormat "%v %h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"" baby

CustomLog "|/hwnet/dtg_devel/web/beta/applications/piwik/misc/log-analytics/import_logs.py --add-sites-new-hosts --url='http://mimir2.icd.teradyne.com/analytics' --log-format-name='common_vhost' --output=/tmp/junk.log -" baby

CustomLog logs/babyfat baby

Lastly the output logfile junk.log gets input when the command is run from the command line but the only time it gets populated from apache is when you add several -d to the CustomLog command and restart apache and then you get:

2012-11-13 15:44:12,517: [DEBUG] Accepted hostnames: all
2012-11-13 15:44:12,517: [DEBUG] Piwik URL is: http://mimir2.icd.teradyne.com/analytics
2012-11-13 15:44:12,517: [DEBUG] No token-auth specified
2012-11-13 15:44:12,517: [DEBUG] No credentials specified, reading them from "/hwnet/dtg_devel/web/beta/applications/piwik/config/config.ini.php"
2012-11-13 15:44:12,520: [DEBUG] Using credentials: (login = piwik, password = a0a582ec5eda9c506a6f30dc8b2bbcf3)
2012-11-13 15:44:13,249: [DEBUG] Accepted hostnames: all
2012-11-13 15:44:13,249: [DEBUG] Piwik URL is: http://mimir2.icd.teradyne.com/analytics
2012-11-13 15:44:13,249: [DEBUG] No token-auth specified
2012-11-13 15:44:13,249: [DEBUG] No credentials specified, reading them from "/hwnet/dtg_devel/web/beta/applications/piwik/config/config.ini.php"
2012-11-13 15:44:13,251: [DEBUG] Using credentials: (login = piwik, password = a0a582ec5eda9c506a6f30dc8b2bbcf3)
2012-11-13 15:44:14,341: [DEBUG] Authentication token token_auth is: 582b588b9568840fa6f1e208a8702b93
2012-11-13 15:44:14,342: [DEBUG] Resolver: dynamic
2012-11-13 15:44:14,342: [DEBUG] Launched recorder
mattab commented 11 years ago

(In [7490]) Fixes #3548 Refs #3163 Any visitor with a user agent containing "spider" will be classified a bot

anonymous-matomo-user commented 11 years ago

I have the same issue as ottodude125. Piping one single line from the access.log into import_logs.py works but using the same command directly from apache nothing gets logged.

EDIT: I noticed the log messages are appearing in the import_logs log when I restart apache. So it seem like this triggers either apache to send the messages to stdin or import_logs to read from stdin.

2EDIT: CustomLog with rotatelog works. So the issue must be the import_logs.py

oliverhumpage commented 11 years ago

@elm @ottodude125

I noticed in ottodude125's customlog, there's no path to the config file and no auth token: that would explain the errors shown in junk.log. You need to specify one or the other so that import_logs.py can authenticate itself to the piwik PHP scripts.

I'm wondering if the same problem is happening for elm's logs too? @elm, if that doesn't fix it, could you paste your customlog section here too?