matomo-org / matomo

Empowering People Ethically with the leading open source alternative to Google Analytics that gives you full control over your data. Matomo lets you easily collect data from websites & apps and visualise this data and extract insights. Privacy is built-in. Liberating Web Analytics. Star us on Github? +1. And we love Pull Requests!
https://matomo.org/
GNU General Public License v3.0
19.68k stars 2.62k forks source link

Log analytics list of improvements #3163

Closed mattab closed 9 years ago

mattab commented 12 years ago

In Piwik 1.8 we released the great new feature to import access logs and generate statistics.

The V1 release works very well (it was tracked in #703), but there are ideas to improve it. This ticket is a placeholder of all ideas and discussions related to the Log Analytics feature!

New features

PERFORMANCE'

How to debug performance? First of all, you can run the script with --dry-run to see how many log lines per second are parsed. It typically should be between 2,000 and 5,000. When you don't do a dry run, it will insert new pageviews and visits calling Piwik API.

Other tickets

mattab commented 10 years ago

There was a patch submitted to keep track of imported files

anonymous-matomo-user commented 10 years ago

Hi,

my box won't properly process log entries passed to stdin of import_logs.py. When i read the exact same entries from a file, everything works great. I am using nginx_json formatted entries. I have tried in dry run mode and normal - each time i read from stdin i get the following output (nothing imported). Can anyone get this setup to work via stdin?

Thank you for your help!

Test data:

{"ip": "41.11.12.41","host": "www.mywebsite.com","path": "/","status": "200","referrer": "http://"www.mywebsite.com/previous","user_agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/32.0.1700.107 Chrome/32.0.1700.107 Safari/537.36","length": 3593,"generation_time_milli": 0.275,"date": "2014-03-12T22:41:23+01:00"}

Python script parameters: --url=http://piwik.mywebsite.com --idsite=1 --recorders=1 --enable-http-errors --enable-reverse-dns --enable-bots --log-format-name=nginx_json

--output 2014-03-12 23:29:37,251: [DEBUG] Accepted hostnames: all 2014-03-12 23:29:37,252: [DEBUG] Piwik URL is: http://piwik.mywebsite.com 2014-03-12 23:29:37,252: [DEBUG] No token-auth specified 2014-03-12 23:29:37,252: [DEBUG] No credentials specified, reading them from "the config file" 2014-03-12 23:29:37,374: [DEBUG] Authentication token token_auth is: a really beautiful token :) 2014-03-12 23:29:37,375: [DEBUG] Resolver: static 0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) 2014-03-12 23:29:37,532: [DEBUG] Launched recorder Parsing log (stdin)... 0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) 0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)

Logs import summary

0 requests imported successfully
0 requests were downloads
0 requests ignored:
    0 invalid log lines
    0 requests done by bots, search engines, ...
    0 HTTP errors
    0 HTTP redirects
    0 requests to static resources (css, js, ...)
    0 requests did not match any known site
    0 requests did not match any requested hostname

Website import summary

0 requests imported to 1 sites
    1 sites already existed
    0 sites were created:

0 distinct hostnames did not match any existing site:

Performance summary

Total time: 10 seconds
Requests imported per second: 0.0 requests per second
oliverhumpage commented 10 years ago

Jadeham,

Try setting --recorder-max-payload-size=1 . I remember having issues myself when testing with very small data sets (e.g. just 1 line).

estemendoza commented 10 years ago

I have a similar problem to Jadeham.

I have configured nginx to log with json format and created the following script that reads from access.log (with json format) and pass every line as stdin:

import sh
from sh import tail

run = sh.Command("/usr/bin/python")
run = run.bake("/var/www/piwik/misc/log-analytics/import_logs.py")
run = run.bake("--output=/home/XXX/piwik_live_importer/piwik.log")
run = run.bake("--url=http://X.X.X.X:8081/piwik/")
run = run.bake("--idsite=1")
run = run.bake("--recorders=1")
run = run.bake("--recorder-max-payload-size=1")
run = run.bake("--enable-http-errors")
run = run.bake("--enable-http-redirects")
run = run.bake("--enable-static")
run = run.bake("--enable-bots")
run = run.bake("--log-format-name=nginx_json")
run = run.bake("-")

for line in tail("-f", "/var/log/nginx/access_json.log", _iter=True):
    run(_in=line)

The problem that I'm having is that it seems that every record is saved but if I go to main panel, today's history it's not shown. This is the output when saving every line:

Parsing log (stdin)...
Purging Piwik archives for dates: 2014-04-16
To re-process these reports with your new update data, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.

Logs import summary
-------------------

    1 requests imported successfully
    2 requests were downloads
    0 requests ignored:
        0 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    1 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:

Performance summary
-------------------

    Total time: 0 seconds
    Requests imported per second: 44.04 requests per second

0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current)

Besides that, when running archive.php, it's slower than parsing default nginx log format and a lot of lines are marked as invalid:

Logs import summary
-------------------

    94299 requests imported successfully
    145340 requests were downloads
    84140 requests ignored:
        84140 invalid log lines
        0 requests done by bots, search engines, ...
        0 HTTP errors
        0 HTTP redirects
        0 requests to static resources (css, js, ...)
        0 requests did not match any known site
        0 requests did not match any requested hostname

Website import summary
----------------------

    94299 requests imported to 1 sites
        1 sites already existed
        0 sites were created:

    0 distinct hostnames did not match any existing site:

Performance summary
-------------------

    Total time: 1147 seconds
    Requests imported per second: 82.21 requests per second

Is there any way to know why these records are not shown and which are the records that are being marked as invalid?

estemendoza commented 10 years ago

Ok, I figured out why the invalid requests. It was because the user_agent had a strange character. So, maybe the script should be aware of unicode characters

mattab commented 10 years ago

To see the data in the dashboard, execute the piwik/misc/cron/archive.php script, or see: http://piwik.org/setup-auto-archiving/ for more info.

Ok, I figured out why the invalid requests. It was because the user_agent had a strange character. So, maybe the script should be aware of unicode characters

Sure, please create a new ticket for this bug and attach a log file with 1 line that showcases the bug. Thanks

anonymous-matomo-user commented 10 years ago

Replying to Hexxer:

Hi,

............. Do you know the exact line that causes a problem? if you put only this line, does it also fail directly? thanks! .............

No, thats my problem. It stops (see above) with the hint to restart "--skip=326". But i dont now what it means. Line 326 in accesslog looks like all the others.

Replying to matt:

I suppose we can do some basic test to see which value works best? Maybe 50 or 100 tracking requests at once? :)

Do you mean me? I cant test over the day because im sitting behind a proxy @work. I can do something in the evening - but, sorry, i have 5 month young lady who needs my love and attention :-)

Wow. 23 months have passed, and still no solution to this problem???

I'm getting the same error, and there's no docco anywhere to tell me how to fix it:

The url is correct (I copy and paste it into my browser, and it gives me the Piwik login screen), and the apache error logs show nothing from today. Here's my console output:

$./import_logs.py --url=https://www.mysite.com/pathto/piwik/ /var/log/apache/access.log --debug 2014-04-28 00:10:29,205: [DEBUG] Accepted hostnames: all 2014-04-28 00:10:29,205: [DEBUG] Piwik URL is: http://www.mysite.com/piwik/ 2014-04-28 00:10:29,205: [DEBUG] No token-auth specified 2014-04-28 00:10:29,205: [No credentials specified, reading them from ".../config/config.ini.php" 2014-04-28 00:10:29,347: [Authentication token token_auth is: REDACTED 2014-04-28 00:10:29,347: [DEBUG] Resolver: dynamic 0 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) 2014-04-28 00:10:29,349: [DEBUG] Launched recorder Parsing log [...]/log/apache/access.log... 2014-04-28 00:10:29,350: [DEBUG] Detecting the log format 2014-04-28 00:10:29,350: [DEBUG] Check format icecast2 2014-04-28 00:10:29,350: [DEBUG] Format icecast2 does not match 2014-04-28 00:10:29,350: [DEBUG] Check format iis 2014-04-28 00:10:29,350: [DEBUG] Format iis does not match 2014-04-28 00:10:29,351: [DEBUG] Check format common 2014-04-28 00:10:29,351: [DEBUG] Format common does not match 2014-04-28 00:10:29,351: [DEBUG] Check format common_vhost 2014-04-28 00:10:29,351: [DEBUG] Format common_vhost matches 2014-04-28 00:10:29,351: [DEBUG] Check format nginx_json 2014-04-28 00:10:29,351: [DEBUG] Format nginx_json does not match 2014-04-28 00:10:29,351: [DEBUG] Check format s3 2014-04-28 00:10:29,352: [DEBUG] Format s3 does not match 2014-04-28 00:10:29,352: [DEBUG] Check format ncsa_extended 2014-04-28 00:10:29,352: [DEBUG] Format ncsa_extended does not match 2014-04-28 00:10:29,352: [DEBUG] Check format common_complete 2014-04-28 00:10:29,352: [DEBUG] Format common_complete does not match 2014-04-28 00:10:29,352: [DEBUG] Format common_vhost is the best match 2014-04-28 00:10:29,424: [Site ID for hostname www.mysite.com not in cache 2014-04-28 00:10:29,563: [DEBUG] Error when connecting to Piwik: HTTP Error 403: Forbidden 2504 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) 2504 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) 2014-04-28 00:10:31,612: [DEBUG] Error when connecting to Piwik: HTTP Error 403: Forbidden 2504 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) 2504 lines parsed, 0 lines recorded, 0 records/sec (avg), 0 records/sec (current) 2014-04-28 00:10:33,657: [DEBUG] Error when connecting to Piwik: HTTP Error 403: Forbidden Fatal error: Forbidden You can restart the import of "[...]/var/log/apache/access.log" from the point it failed by specifying --skip=5 on the command line.

And of course, trying with --skip=5 produces the same error.

I have googled, I have searched the archives, the bug tracker contains no clue. Would really appreciate some kind soul taking mercy on me here.

mattab commented 10 years ago

Piwik: HTTP Error 403: Forbidden

Please check your webserver error logs, there should be an error 403 logged in there that will maybe tell you why the Piwik API is failing to return data (maybe a server misconfiguration?).

anonymous-matomo-user commented 10 years ago

Replying to matt:

Piwik: HTTP Error 403: Forbidden

Please check your webserver error logs, there should be an error 403 logged in there that will maybe tell you why the Piwik API is failing to return data (maybe a server misconfiguration?).

Apache error log shows only a restart once every hour. I am unable to configure Apache directly, as I am running Piwik on Gandi.net's "Simple Hosting" service. I have repeatedly begged gandi support to look into this matter, but their attitude is (and not unreasonably) that their job is not to support user installation issues like this. If you can give me ammunition that shows it really is Gandi's fault, then maybe we can move forward here.

Or maybe it's just a Piwik bug. Or I'm doing something wrong. I don't know.

f

mattab commented 10 years ago

@foobard I suggest you create a new ticket for your particular issue, and we will try help you troubleshoot it (maybe we need to get access to server to reproduce and investigate). Cheers!

mattab commented 10 years ago

Please do not comment on this ticket anymore. instead, create a new ticket and assign it to "Component 'Log Analytics (import_logs.py)'

Here is the list of all tickets related to Log Analytics improvements: http://dev.piwik.org/trac/query?status=!closed&component=Log+Analytics+(import_logs.py)

mattab commented 9 years ago

Issue was moved to the new repository for Piwik Log Analytics: https://github.com/piwik/piwik-log-analytics/issues

refs #7163