Closed mattab closed 9 years ago
Attachment: Document the debian vhost_combined format vhost_combined.patch
Attachment: Force hostname patch force_hostname.patch
Attachment: README.apache_log_recorders.patch
Attachment: WinZip compressed file u_ex120813.zip
Attachment: Sample IIS file for testing variations of c-ip field test_c-ip_iis_log.log
Attachment: Log Parser README Update with Nginx Log Format for Common Complete README_nginx_log_format.diff
Attachment: Log for WMS 9.0 WMS_20130523.log
Could I just re-ask an unanswered problem from ticket #703 - #703 ? If instead of specifying a file you do
cat /path/to/log | log_import.py [options] -
then does it work for you, or do you just get 0 lines imported? Because with the latest version I'm getting 0 lines imported, and that means I can't log straight from apache (and hence the README is wrong too).
oliverhumpage: I couldn't reproduce this issue. Do you get it with --dry-run too? Could you send a minimal log file?
Counting Downloads:
In a podcast project I want to count only the downloads of file type "mp3" and "ogg". In an other project it would be nice only to count the pdf-Downloads.
Another topic in this area is, how are downloads counted? Not every occurence of the file in the logs is a download. For instance, I am using a html5-player. Users might here one part of the podcast on their first visit and other parts on succeeding visitis. All together would be one download.
A possible "solution" (or may be a workaround): Sum up all the "bytes transferred" and divide it by the largest "bytes transferred" for a certain file.
Feature Request: Support Icecast Logs currently we use Awstats but will be great can move to PIWIK.
@Cyril
Having spent some time looking into it, and working out exactly which revision caused the problem, I think it's down to the regex I used in --log-format-regex not working any more. Turns out the regex format in import_logs.py has had the group <timezone> added to it, which seems to be required by code further down the script.
Could you update the readme so the middle of the regex changes from:
\[(?P<date>.*?)\]
to
This will then make it all work.
Thanks,
Oliver.
(In [6471]) Refs #3163 Fixed regexp in README.
Oliver: indeed, I've just fixed it, thanks.
I've been fiddling with this tool, it looks really nice, the biggest issue I've found is when using --add-sites-new-hosts It's quite difficult in my case (using a control panel) to add the required %v:%p fields in the custom log format. What I do have is a log for every domain, so being able to specify the hostname manually would do the trick for me.
In the current situation launching this:
python /var/www/piwik/misc/log-analytics/import_logs.py
--url=https://server.example.com/tools/piwik --recorders=4 --enable-http-errors
--enable-http-redirects --enable-static --enable-bots --add-sites-new-hosts /var/log/apache2/example.com-combined.log
Just produces this:
Fatal error: the selected log format doesn't include the hostname:
you must specify the Piwik site ID with the --idsite argument
By having a --hostname example.com (the same as the filename in my case) that fixed the hostname (such as -idsite-fallback=) would fix my issues.
I'm not a piwik dev, but what I think you're trying to do is:
For every logfile, get its filename (which is also the hostname), check if a site with that hostname exists in piwik: if it does exist, import the logfile to it; if it doesn't exist, create it, then import the logfile to it.
The way I'd do this is to write an import script which:
http://piwik.org/docs/analytics-api/reference/ gives the various API calls, looks like SitesManager.getAllSites and SitesManager.addSite will do the job (e.g. call http://your.piwik.install/?module=API&method=SitesManager.getAllSites&format=xml&token_auth=xxxxxxxxxx to get all current sites, etc).
HTH (a real piwik person might have a better idea)
Oliver.
Thanks for your answer Oliver, your process is perfectly fine, but I'd rather like to avoid having to code something that could be avoided by extending just a little the funtionality of --add-sites-new-hosts. And thanks for the links too, I'll have look.
It would be nice to document the standard format provided (at the moment only debian/ubuntu) that would give piwik the hostname that is required.
The format is this:
LogFormat "%v:%p %h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" vhost_combined
You can see the latest version from debian's apache2.conf [http://anonscm.debian.org/gitweb/?p=pkg-apache/apache2.git;a=blob;f=debian/config-dir/apache2.conf;h=50545671cbaeb1f170d5f3f1acd20ad3978f36ea;hb=HEAD]
See attached a small change to the README file.
After looking at the code I created a patch to add a new option called --force-hostname that expects an string with the hostname. In case it's set, the value of host will be ALWAYS the one entered by --force-hostname. This allows to deal with logfiles of ncsa_extended or common as if they were complete formats. (creating idsites when needed and so on..)
(In [6474]) Refs #3163 Added the --log-hostname option, thanks to aseques.
(In [6475]) Refs #3163 Added a reference in README to the Debian/Ubuntu default vhost_combined, thanks aseques.
Thanks aseques, both your feature request and your patch were fine, I've just committed it. Attention: I renamed the option to --log-hostname to keep coherence with the --log prefix.
Great, this will be so useful for me :)
Hi,
im not sure that im right in here for a ticket or problem? I have a problem importing access_logs from my shared webspace. I copy test from here http://forum.piwik.org/read.php?2,90313
Hi,
im on a shared webspace with ssh support. I try your import script to analyse my apache logs. I get it to work, but there are sometime some "Fatal errors" and i have no idea why. It is, if i restart it without "skip" every time the same "skip-line"
Example:
4349 lines parsed, 85 lines recorded, 0 records/sec
4349 lines parsed, 85 lines recorded, 0 records/sec
4349 lines parsed, 85 lines recorded, 0 records/sec
4349 lines parsed, 85 lines recorded, 0 records/sec
Fatal error: Forbidden
You can restart the import of "/home/log/access_log_piwik" from the point it failed by specifying --skip=326 on the command line.
I try to figure out on what line these script end with that fata error, but i cant. If restart it at "skip=327" that it runs to the end and all works fine. Same problem is on some other access_logs "access_log_1.gz" and so on. But im not sure why it ends. If that is a misconfigured line in accesslog? Which line should i check?
Regards
Hexxer: you're getting a HTTP Forbidden from your Piwik install when importing the logs, you need to find out why.
How do you now that? It stops every time at the same line and if i skip that it runs 10 oder 15 minutes without a problem (up to this line it need 2 minutes or so).
Regards
Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks!
Benaka is implementing Bulk tracking in the ticket #3134 - The python script will simply have to send a JSON array:
["requests":[url1,url2,url3],"token_auth":"xyz"]
I suppose we can do some basic test to see which value works best? Maybe 50 or 100 tracking requests at once? :)
Hi,
............. Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks! .............
No, thats my problem. It stops (see above) with the hint to restart "--skip=326". But i dont now what it means. Line 326 in accesslog looks like all the others.
Replying to matt:
I suppose we can do some basic test to see which value works best? Maybe 50 or 100 tracking requests at once? :)
Do you mean me? I cant test over the day because im sitting behind a proxy @work. I can do something in the evening - but, sorry, i have 5 month young lady who needs my love and attention :-)
Could I submit a request for an alteration to the README? I've just had a massive spike in traffic, and --recorders=1 just doesn't cut it when piping directly from apache's customlog :) Because each apache process hangs around waiting to log its request before moving onto the next request, it started jamming the server.
Setting a higher --recorders seems to have eased it, and there are no side effects that I can see so far.
Suggested patch attached to this ticket.
Hi,
Is there a doc about the regex format for import_logs.py ?
We would like to import a file with awstat logFormat :
%time2 %other %cluster %other %method %url %query %other %logname %host %other %ua %referer %virtualname %code %other %other %bytesd %other %other Thanks for your help,
Ludovic
I am trying to set up a daily log import from the previous day. my issue is that my host date stamps the log file, how can I set it to import a log file with yesterdays date on it?
Here is the format of my log files access.log.%Y-%m-%d.log
Thanks a lot for all your great work! The server log file analytics works great on my server.
I am using a lighttpd server and added the Accept-Language header to accesslog.format:
accesslog.format = "%h %V %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" \"%{Accept-Language}i\"" (see http://redmine.lighttpd.net/projects/lighttpd/wiki/Docs:ModAccessLog)
I wonder if it would be possible to add support for the Accept-Language header to import_logs.py? So that the country could then be guessed from the Accept-Language header when GeoIP isn't installed.
Replying to Cyril:
(In [6474]) Refs #3163 Added the --log-hostname option, thanks to aseques.
Thanks for possibilities to import logs and also thanks for the log-hostname patch. Not sure whether it is the patch or it is caused by using --recorders > 1, but with the first run with --add-sites-new-host I got 13 sites for the same hostname created.
I'm having a similar problem to Hexxer. When I do a --dry-run I get no errors, but when adding to Piwik it falls over at about the same spot. It's not one offending log file or line of a log file that's causing it. I'll attach the output with debugging on below. I've run the script multiple times, by removing the line where the script has fallen over, removing the log file where it has fallen over etc. It always dies line ~9000-10000 in the 3rd log file.
I'm not sure if this is of interest but when doing a dry run the script does ~600lines/sec when importing to Piwik it does ~16,
The output file is here. Akismet was marking the attachment as spam
(In [6509]) Refs #3163 updated README to suggest increasing the --recorders value.
oliverhumpage: thanks, I've committed your diff.
ludopaquet: no doc yet, I suggest you take a look at the code, taking _COMMON_LOG_FORMAT as example.
lewmat21: I suppose each log line has its own date anyway, so it doesn't matter what the filename is.
sc_: I don't think using Accept-Language to guess the country is a good idea. As the header name says, it's about languages (locales), not countries. First, many languages are spoken in several countries. If the Accept-Language says you accept English, what country would you pick? Second, people can have Accept-Languages that don't match their country. I personnally surf using English as Accept-Languages, whereas I'm French and live in France.
law: can you reproduce the issue? If so, can you give me the access log as well as the full command line you used?
andrewc: can you edit line 741 and increase the 200 value to something like 10000? It will print the full error message instead of only the first 200 characters, which is not enough to get the Piwik error.
@Cyril: As far as I know Piwik does the same when the GeoIP plugin isn't used: http://piwik.org/faq/troubleshooting/#faq_65 The location is then guessed from en-us, fr-fr etc.
But the more important point is that it would be useful for website development to know what languages people use who visit my website. So it would be great if support for the Accept-Language header could be added.
Sorry for the wrong formatting (the preview didn't work). Here is the correct link:
@Cyril: Here's the output file with the full error messages.
Replying to andrewc:
@Cyril: Here's the output file with the full error messages. Sorry this is the link https://www.dropbox.com/sh/zat1m6lqphndpny/wH6n4mDaD6/output0907.txt
sc_: OK, I didn't know about this. Considering GeoIP will be integrated into Piwik soon (see #1823), which is a much better solution, I don't think we should modify the import script to use Accept-Language headers.
andrewc: your Piwik install (the PHP part) is returning errors: Only one usage of each socket address (protocol/network address/port) is normally permitted
You need to find out why and fix it. It's unrelated to the import script.
Thanks for your great work, we gave import logs some time now and have few ideas/problems.
I don't know, should I open new tickets or write here?
One major thing is how to bring number of visitors/unique visitors down to make it more similar to javascript tracking and google analytics.
I understand that we don't have cookies and other config information to identify visitor.
We've managed to bring number of pageviews/actions down few times (from 5 times to 2 times more than javascript tracking). Or many more in few cases (like from 100 times more than javascript).
Our ideas and changes include (we assumed that we should get numbers as close as in javascript tracking):
few workarounds :)
We ended up at number of actions (pageviews) about twice the number of javascript without influence number of visitors (about 50% bigger than javascript).
Our extreme case is 300 views (javascript tracking) and 30 000 views with import script after changes - about 570 views with import script.
fjohn:
Regarding excluding some specific paths (index.php?minimize_js, img_thumb.php, etc.): there are a gazillion "popular" paths that could be excluded, but I don't think it's a good idea to include those by default, for several reasons:
So it's not something that we should do by default. We have --exclude-path and --exclude-path-from options that allow you to create your own list of paths to exclude, depending on your site.
What we may do is create such a list in Piwik (in an external file), but not enable it by default. People that want to use this could add --exclude-path-from=common_excluded_paths.txt (for instance). What do you think of this, matt?
Replying to Cyril:
fjohn:
- why shouldn't we count POST requests? HEAD, I can agree, but POST are legitimate requests made by regular browsers
But POST is also used by ajax requests all the time (and this is not what we would count with JS). We've just simplified that to drop anything other than GET.
- what kind of user-agent doesn't have OS data? Aren't they bots anyway?
for me question is - does "real user" always send OS data. On our logs there were for example, curl, python libs, xrumer, scrapers and many more odd requests that weren't on the bot list.
- limiting actions: that's on the PHP-side, I'll let matt answer this
yeap, it is. But we have a lot of bots that were not on list, don't know how it is working but they were in import log profile, not in javascript profile.
Regarding excluding some specific paths (index.php?minimize_js, img_thumb.php, etc.): there are a gazillion "popular" paths that could be excluded, but I don't think it's a good idea to include those by default, for several reasons:
- it would be a cumbersome list to maintain, and people could argue what paths deserve to be included or not, depending on how popular the script is
I agree with you, we identified 2 of them (thumbs and minimizers) and we have very universal code for it - example (if picture and &w and &h) those identify 3 most popular thumb scripts (including those in wordpress and oscommerce).
We did it because on oscommerce shop we had 1000 more page views than on javascript - should we accept that?
- there would be false positives (what if I have a legitimate img_thumb.php that should be included in page views?)
Does it with javascript tracking? From our tests not.
- most importantly, such a list would be quite large, and that would really slow down the importing process (as each hit would have to be compared with all excluded paths).
We have only 2 more "if statements" on current FOR loops. Still you're right, that can grow :)
So it's not something that we should do by default. We have --exclude-path and --exclude-path-from options that allow you to create your own list of paths to exclude, depending on your site.
What we may do is create such a list in Piwik (in an external file), but not enable it by default. People that want to use this could add --exclude-path-from=common_excluded_paths.txt (for instance). What do you think of this, matt?
That could be a good idea, would be nice to test this on larger number of websites/scripts, we've tested 5 regular websites and few other scripts.
Ajax requests do not use POST all the time at all. For instance, jQuery (the most popular Javascript library) uses GET by default: http://api.jquery.com/jQuery.ajax/
Regarding the rest of comments: just to make things clear, I wasn't advocating against what you did for your specific site, but against doing this by default in the script. I very much prefer to add options to the import script (it has quite many already) to allow users to customize it for their own needs rather than try to have sane defaults, which we really can't do as there's too much diversity on the Web :)
Cyril:
About Ajax - that is why we made limit of 100 page views per visitor. We found a case when one user made from 700 to 1000 views thanks to ajax by GET requests.
About whole thing. Sure, I understand that. But we wanted to use it for hosting company, and we are not making any "special case" we are trying to test log import on as many websites as we can.
So we just wanted to share some of our tests and ideas. In most cases everything works good, but wordpress or oscommerce are very popular.
Showing customers 30k views instead of 300 is not the best way to prove that import logs is working fine. On IPB forum we've had 5 times more pageviews, now less than twice JS.
@oliverhumpage and to all listening in this ticket: is there any other pending bug or important missing feature in this script?
Are you all happy with it? Note: we are working on performance next.
My Apache log gives hostnames rather than IP addresses. It looks like the import script sends the hostname, which the server side tries to interpret as a numeric IP value, with the result that all hostnames translate to 0.0.0.0. I added a call to socket.gethostbyname() in the import script, but it's undone all the performance gains I got through the bulk request patch.
Is there some simple fix that I'm missing here?
In Piwik 1.8 we released the great new feature to import access logs and generate statistics.
The V1 release works very well (it was tracked in #703), but there are ideas to improve it. This ticket is a placeholder of all ideas and discussions related to the Log Analytics feature!
New features
Track non-bot activity only. When --enable-bots is not specified, it would be a nice improvement if we:
After that bots & crawlers detection would be much better.
PERFORMANCE'
How to debug performance? First of all, you can run the script with --dry-run to see how many log lines per second are parsed. It typically should be between 2,000 and 5,000. When you don't do a dry run, it will insert new pageviews and visits calling Piwik API.
Other tickets
3867 cannot resume with line number reported by skip for ncsa_extended log format
4045 autodetection hangs on a weird formatted line