matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.68k stars 2.62k forks source link

Log analytics list of improvements #3163

Closed mattab closed 9 years ago

mattab commented 12 years ago

In Piwik 1.8 we released the great new feature to import access logs and generate statistics.

The V1 release works very well (it was tracked in #703), but there are ideas to improve it. This ticket is a placeholder of all ideas and discussions related to the Log Analytics feature!

New features

PERFORMANCE'

How to debug performance? First of all, you can run the script with --dry-run to see how many log lines per second are parsed. It typically should be between 2,000 and 5,000. When you don't do a dry run, it will insert new pageviews and visits calling Piwik API.

Other tickets

anonymous-matomo-user commented 12 years ago

Attachment: Document the debian vhost_combined format vhost_combined.patch

anonymous-matomo-user commented 12 years ago

Attachment: Force hostname patch force_hostname.patch

oliverhumpage commented 12 years ago

Attachment: README.apache_log_recorders.patch

aspectra commented 12 years ago

Attachment: WinZip compressed file u_ex120813.zip

anonymous-matomo-user commented 12 years ago

Attachment: Sample IIS file for testing variations of c-ip field test_c-ip_iis_log.log

anonymous-matomo-user commented 11 years ago

Attachment: Log Parser README Update with Nginx Log Format for Common Complete README_nginx_log_format.diff

anonymous-matomo-user commented 11 years ago

Attachment: Log for WMS 9.0 WMS_20130523.log

oliverhumpage commented 12 years ago

Could I just re-ask an unanswered problem from ticket #703 - #703 ? If instead of specifying a file you do

cat /path/to/log | log_import.py [options] -

then does it work for you, or do you just get 0 lines imported? Because with the latest version I'm getting 0 lines imported, and that means I can't log straight from apache (and hence the README is wrong too).

cbay commented 12 years ago

oliverhumpage: I couldn't reproduce this issue. Do you get it with --dry-run too? Could you send a minimal log file?

ddeimeke commented 12 years ago

Counting Downloads:

In a podcast project I want to count only the downloads of file type "mp3" and "ogg". In an other project it would be nice only to count the pdf-Downloads.

Another topic in this area is, how are downloads counted? Not every occurence of the file in the logs is a download. For instance, I am using a html5-player. Users might here one part of the podcast on their first visit and other parts on succeeding visitis. All together would be one download.

A possible "solution" (or may be a workaround): Sum up all the "bytes transferred" and divide it by the largest "bytes transferred" for a certain file.

anonymous-matomo-user commented 12 years ago

Feature Request: Support Icecast Logs currently we use Awstats but will be great can move to PIWIK.

oliverhumpage commented 12 years ago

@Cyril

Having spent some time looking into it, and working out exactly which revision caused the problem, I think it's down to the regex I used in --log-format-regex not working any more. Turns out the regex format in import_logs.py has had the group <timezone> added to it, which seems to be required by code further down the script.

Could you update the readme so the middle of the regex changes from:

\[(?P<date>.*?)\]

to

\(?P<timezone>._?)\

This will then make it all work.

Thanks,

Oliver.

cbay commented 12 years ago

(In [6471]) Refs #3163 Fixed regexp in README.

cbay commented 12 years ago

Oliver: indeed, I've just fixed it, thanks.

anonymous-matomo-user commented 12 years ago

I've been fiddling with this tool, it looks really nice, the biggest issue I've found is when using --add-sites-new-hosts It's quite difficult in my case (using a control panel) to add the required %v:%p fields in the custom log format. What I do have is a log for every domain, so being able to specify the hostname manually would do the trick for me.

In the current situation launching this:

python /var/www/piwik/misc/log-analytics/import_logs.py 
  --url=https://server.example.com/tools/piwik --recorders=4 --enable-http-errors 
  --enable-http-redirects --enable-static --enable-bots --add-sites-new-hosts  /var/log/apache2/example.com-combined.log

Just produces this:

Fatal error: the selected log format doesn't include the hostname: 
  you must specify the Piwik site ID with the --idsite argument

By having a --hostname example.com (the same as the filename in my case) that fixed the hostname (such as -idsite-fallback=) would fix my issues.

oliverhumpage commented 12 years ago

I'm not a piwik dev, but what I think you're trying to do is:

For every logfile, get its filename (which is also the hostname), check if a site with that hostname exists in piwik: if it does exist, import the logfile to it; if it doesn't exist, create it, then import the logfile to it.

The way I'd do this is to write an import script which:

http://piwik.org/docs/analytics-api/reference/ gives the various API calls, looks like SitesManager.getAllSites and SitesManager.addSite will do the job (e.g. call http://your.piwik.install/?module=API&method=SitesManager.getAllSites&format=xml&token_auth=xxxxxxxxxx to get all current sites, etc).

HTH (a real piwik person might have a better idea)

Oliver.

anonymous-matomo-user commented 12 years ago

Thanks for your answer Oliver, your process is perfectly fine, but I'd rather like to avoid having to code something that could be avoided by extending just a little the funtionality of --add-sites-new-hosts. And thanks for the links too, I'll have look.

anonymous-matomo-user commented 12 years ago

It would be nice to document the standard format provided (at the moment only debian/ubuntu) that would give piwik the hostname that is required.

The format is this:

LogFormat "%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined

You can see the latest version from debian's apache2.conf [http://anonscm.debian.org/gitweb/?p=pkg-apache/apache2.git;a=blob;f=debian/config-dir/apache2.conf;h=50545671cbaeb1f170d5f3f1acd20ad3978f36ea;hb=HEAD]

See attached a small change to the README file.

anonymous-matomo-user commented 12 years ago

After looking at the code I created a patch to add a new option called --force-hostname that expects an string with the hostname. In case it's set, the value of host will be ALWAYS the one entered by --force-hostname. This allows to deal with logfiles of ncsa_extended or common as if they were complete formats. (creating idsites when needed and so on..)

cbay commented 12 years ago

(In [6474]) Refs #3163 Added the --log-hostname option, thanks to aseques.

cbay commented 12 years ago

(In [6475]) Refs #3163 Added a reference in README to the Debian/Ubuntu default vhost_combined, thanks aseques.

cbay commented 12 years ago

Thanks aseques, both your feature request and your patch were fine, I've just committed it. Attention: I renamed the option to --log-hostname to keep coherence with the --log prefix.

anonymous-matomo-user commented 12 years ago

Great, this will be so useful for me :)

anonymous-matomo-user commented 12 years ago

Hi,

im not sure that im right in here for a ticket or problem? I have a problem importing access_logs from my shared webspace. I copy test from here http://forum.piwik.org/read.php?2,90313


Hi,

im on a shared webspace with ssh support. I try your import script to analyse my apache logs. I get it to work, but there are sometime some "Fatal errors" and i have no idea why. It is, if i restart it without "skip" every time the same "skip-line"

Example:

4349 lines parsed, 85 lines recorded, 0 records/sec

4349 lines parsed, 85 lines recorded, 0 records/sec

4349 lines parsed, 85 lines recorded, 0 records/sec

4349 lines parsed, 85 lines recorded, 0 records/sec

Fatal error: Forbidden

You can restart the import of "/home/log/access_log_piwik" from the point it failed by specifying --skip=326 on the command line.


I try to figure out on what line these script end with that fata error, but i cant. If restart it at "skip=327" that it runs to the end and all works fine. Same problem is on some other access_logs "access_log_1.gz" and so on. But im not sure why it ends. If that is a misconfigured line in accesslog? Which line should i check?

Regards

cbay commented 12 years ago

Hexxer: you're getting a HTTP Forbidden from your Piwik install when importing the logs, you need to find out why.

anonymous-matomo-user commented 12 years ago

How do you now that? It stops every time at the same line and if i skip that it runs 10 oder 15 minutes without a problem (up to this line it need 2 minutes or so).

Regards

mattab commented 12 years ago

Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks!

mattab commented 12 years ago

Benaka is implementing Bulk tracking in the ticket #3134 - The python script will simply have to send a JSON array:

["requests":[url1,url2,url3],"token_auth":"xyz"]

I suppose we can do some basic test to see which value works best? Maybe 50 or 100 tracking requests at once? :)

anonymous-matomo-user commented 12 years ago

Hi,

............. Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks! .............

No, thats my problem. It stops (see above) with the hint to restart "--skip=326". But i dont now what it means. Line 326 in accesslog looks like all the others.

Replying to matt:

I suppose we can do some basic test to see which value works best? Maybe 50 or 100 tracking requests at once? :)

Do you mean me? I cant test over the day because im sitting behind a proxy @work. I can do something in the evening - but, sorry, i have 5 month young lady who needs my love and attention :-)

oliverhumpage commented 12 years ago

Could I submit a request for an alteration to the README? I've just had a massive spike in traffic, and --recorders=1 just doesn't cut it when piping directly from apache's customlog :) Because each apache process hangs around waiting to log its request before moving onto the next request, it started jamming the server.

Setting a higher --recorders seems to have eased it, and there are no side effects that I can see so far.

Suggested patch attached to this ticket.

anonymous-matomo-user commented 12 years ago

Hi,

Is there a doc about the regex format for import_logs.py ?

We would like to import a file with awstat logFormat :

%time2 %other %cluster %other %method %url %query %other %logname %host %other %ua %referer %virtualname %code %other %other %bytesd %other %other Thanks for your help,

Ludovic

anonymous-matomo-user commented 12 years ago

I am trying to set up a daily log import from the previous day. my issue is that my host date stamps the log file, how can I set it to import a log file with yesterdays date on it?

Here is the format of my log files access.log.%Y-%m-%d.log

anonymous-matomo-user commented 12 years ago

Thanks a lot for all your great work! The server log file analytics works great on my server.

I am using a lighttpd server and added the Accept-Language header to accesslog.format:

accesslog.format = "%h %V %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Accept-Language}i\"" (see http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs:ModAccessLog)

I wonder if it would be possible to add support for the Accept-Language header to import_logs.py? So that the country could then be guessed from the Accept-Language header when GeoIP isn't installed.

anonymous-matomo-user commented 12 years ago

Replying to Cyril:

(In [6474]) Refs #3163 Added the --log-hostname option, thanks to aseques.

Thanks for possibilities to import logs and also thanks for the log-hostname patch. Not sure whether it is the patch or it is caused by using --recorders > 1, but with the first run with --add-sites-new-host I got 13 sites for the same hostname created.

anonymous-matomo-user commented 12 years ago

I'm having a similar problem to Hexxer. When I do a --dry-run I get no errors, but when adding to Piwik it falls over at about the same spot. It's not one offending log file or line of a log file that's causing it. I'll attach the output with debugging on below. I've run the script multiple times, by removing the line where the script has fallen over, removing the log file where it has fallen over etc. It always dies line ~9000-10000 in the 3rd log file.

I'm not sure if this is of interest but when doing a dry run the script does ~600lines/sec when importing to Piwik it does ~16,

anonymous-matomo-user commented 12 years ago

The output file is here. Akismet was marking the attachment as spam

cbay commented 12 years ago

(In [6509]) Refs #3163 updated README to suggest increasing the --recorders value.

cbay commented 12 years ago

oliverhumpage: thanks, I've committed your diff.

ludopaquet: no doc yet, I suggest you take a look at the code, taking _COMMON_LOG_FORMAT as example.

lewmat21: I suppose each log line has its own date anyway, so it doesn't matter what the filename is.

sc_: I don't think using Accept-Language to guess the country is a good idea. As the header name says, it's about languages (locales), not countries. First, many languages are spoken in several countries. If the Accept-Language says you accept English, what country would you pick? Second, people can have Accept-Languages that don't match their country. I personnally surf using English as Accept-Languages, whereas I'm French and live in France.

law: can you reproduce the issue? If so, can you give me the access log as well as the full command line you used?

andrewc: can you edit line 741 and increase the 200 value to something like 10000? It will print the full error message instead of only the first 200 characters, which is not enough to get the Piwik error.

anonymous-matomo-user commented 12 years ago

@Cyril: As far as I know Piwik does the same when the GeoIP plugin isn't used: http://piwik.org/faq/troubleshooting/#faq_65 The location is then guessed from en-us, fr-fr etc.

But the more important point is that it would be useful for website development to know what languages people use who visit my website. So it would be great if support for the Accept-Language header could be added.

anonymous-matomo-user commented 12 years ago

Sorry for the wrong formatting (the preview didn't work). Here is the correct link:

http://piwik.org/faq/troubleshooting/#faq_65

anonymous-matomo-user commented 12 years ago

@Cyril: Here's the output file with the full error messages.

anonymous-matomo-user commented 12 years ago

Replying to andrewc:

@Cyril: Here's the output file with the full error messages. Sorry this is the link https://www.dropbox.com/sh/zat1m6lqphndpny/wH6n4mDaD6/output0907.txt

cbay commented 12 years ago

sc_: OK, I didn't know about this. Considering GeoIP will be integrated into Piwik soon (see #1823), which is a much better solution, I don't think we should modify the import script to use Accept-Language headers.

andrewc: your Piwik install (the PHP part) is returning errors: Only one usage of each socket address (protocol/network address/port) is normally permitted

You need to find out why and fix it. It's unrelated to the import script.

anonymous-matomo-user commented 12 years ago

Thanks for your great work, we gave import logs some time now and have few ideas/problems.

I don't know, should I open new tickets or write here?

One major thing is how to bring number of visitors/unique visitors down to make it more similar to javascript tracking and google analytics.

I understand that we don't have cookies and other config information to identify visitor.

We've managed to bring number of pageviews/actions down few times (from 5 times to 2 times more than javascript tracking). Or many more in few cases (like from 100 times more than javascript).

Our ideas and changes include (we assumed that we should get numbers as close as in javascript tracking):

few workarounds :)

We ended up at number of actions (pageviews) about twice the number of javascript without influence number of visitors (about 50% bigger than javascript).

Our extreme case is 300 views (javascript tracking) and 30 000 views with import script after changes - about 570 views with import script.

cbay commented 12 years ago

fjohn:

Regarding excluding some specific paths (index.php?minimize_js, img_thumb.php, etc.): there are a gazillion "popular" paths that could be excluded, but I don't think it's a good idea to include those by default, for several reasons:

So it's not something that we should do by default. We have --exclude-path and --exclude-path-from options that allow you to create your own list of paths to exclude, depending on your site.

What we may do is create such a list in Piwik (in an external file), but not enable it by default. People that want to use this could add --exclude-path-from=common_excluded_paths.txt (for instance). What do you think of this, matt?

anonymous-matomo-user commented 12 years ago

Replying to Cyril:

fjohn:

  • why shouldn't we count POST requests? HEAD, I can agree, but POST are legitimate requests made by regular browsers

But POST is also used by ajax requests all the time (and this is not what we would count with JS). We've just simplified that to drop anything other than GET.

  • what kind of user-agent doesn't have OS data? Aren't they bots anyway?

for me question is - does "real user" always send OS data. On our logs there were for example, curl, python libs, xrumer, scrapers and many more odd requests that weren't on the bot list.

  • limiting actions: that's on the PHP-side, I'll let matt answer this

yeap, it is. But we have a lot of bots that were not on list, don't know how it is working but they were in import log profile, not in javascript profile.

Regarding excluding some specific paths (index.php?minimize_js, img_thumb.php, etc.): there are a gazillion "popular" paths that could be excluded, but I don't think it's a good idea to include those by default, for several reasons:

  • it would be a cumbersome list to maintain, and people could argue what paths deserve to be included or not, depending on how popular the script is

I agree with you, we identified 2 of them (thumbs and minimizers) and we have very universal code for it - example (if picture and &w and &h) those identify 3 most popular thumb scripts (including those in wordpress and oscommerce).

We did it because on oscommerce shop we had 1000 more page views than on javascript - should we accept that?

  • there would be false positives (what if I have a legitimate img_thumb.php that should be included in page views?)

Does it with javascript tracking? From our tests not.

  • most importantly, such a list would be quite large, and that would really slow down the importing process (as each hit would have to be compared with all excluded paths).

We have only 2 more "if statements" on current FOR loops. Still you're right, that can grow :)

So it's not something that we should do by default. We have --exclude-path and --exclude-path-from options that allow you to create your own list of paths to exclude, depending on your site.

What we may do is create such a list in Piwik (in an external file), but not enable it by default. People that want to use this could add --exclude-path-from=common_excluded_paths.txt (for instance). What do you think of this, matt?

That could be a good idea, would be nice to test this on larger number of websites/scripts, we've tested 5 regular websites and few other scripts.

cbay commented 12 years ago

Ajax requests do not use POST all the time at all. For instance, jQuery (the most popular Javascript library) uses GET by default: http://api.jquery.com/jQuery.ajax/

Regarding the rest of comments: just to make things clear, I wasn't advocating against what you did for your specific site, but against doing this by default in the script. I very much prefer to add options to the import script (it has quite many already) to allow users to customize it for their own needs rather than try to have sane defaults, which we really can't do as there's too much diversity on the Web :)

anonymous-matomo-user commented 12 years ago

Cyril:

About Ajax - that is why we made limit of 100 page views per visitor. We found a case when one user made from 700 to 1000 views thanks to ajax by GET requests.

About whole thing. Sure, I understand that. But we wanted to use it for hosting company, and we are not making any "special case" we are trying to test log import on as many websites as we can.

So we just wanted to share some of our tests and ideas. In most cases everything works good, but wordpress or oscommerce are very popular.

Showing customers 30k views instead of 300 is not the best way to prove that import logs is working fine. On IPB forum we've had 5 times more pageviews, now less than twice JS.

mattab commented 12 years ago

@oliverhumpage and to all listening in this ticket: is there any other pending bug or important missing feature in this script?

Are you all happy with it? Note: we are working on performance next.

anonymous-matomo-user commented 12 years ago

My Apache log gives hostnames rather than IP addresses. It looks like the import script sends the hostname, which the server side tries to interpret as a numeric IP value, with the result that all hostnames translate to 0.0.0.0. I added a call to socket.gethostbyname() in the import script, but it's undone all the performance gains I got through the bulk request patch.

Is there some simple fix that I'm missing here?