Log Analytics could detect log lines that were already imported and skip them automatically

mattab commented 8 years ago

Log Analytics is a powerful tool of the Piwik platform, and used by thousands of people in many interesting use cases. It is quite powerful and relatively easy to use, and offers has many options and features. We like to make our tools as easy as possible to use... this issue is about making Log Analytics easier to use and even more flexible for people.

Issue: the log data is not deduplicated

When you import logs in Piwik, Piwik will always import and track all the logs. When you import again the same log file in Piwik, it will be imported again by the Tracking API, and the data will end up being duplicated in the Piwik database.

Why this is not good enough

our users rightfully expect Piwik to be easy to use and do the right thing. Recently @Synchro reported this issue and did not expect Log Analytics to import the data again and again. See the description at: https://github.com/piwik/piwik/issues/10248#issuecomment-232085596

Over the years many users have reported experiencing this issue.

Existing workaround

So far most people manage to use Log Analytics despite this limitation. The common workaround is to create one log file per hour, or one log file per day, and import each log file only once. Commonly, people write a script which makes sure that each log file is imported only once. For example, the log files may be ingested into Piwik while/after they have been rotated.

Solution

Ideally, we do not want people to worry whether they have imported a given log file, or even whether a log file was partially imported before and is re-imported again. We want Piwik to automatically deduplicate the tracking API data.

so far I see two possible ways to fix this issue:

1. new Piwik Tracking API feature: request id deduplicator

The Tracking API could introduce a new feature, to let tracking api users specify a request ID for the given request. Piwik would store the request ID for each request and use this request ID as a unique key. If any tracking API request for a given date with a given request ID has already been tracked/imported in this date, then the request would be skipped. Each request id will be imported at most once for a given day.

The Log Analytics tool will then simply create, for each log file's line that is parsed, a request ID and pass it to the tracking api request to let the Tracking API deduplicate the requests. Log analytics could create this request id as a hash of the log line or so.

Pros: other Tracking API SDKs and clients will be able to use this feature to deduplicate the data.
2. Implement request ID deduplicator in Log Analytics only

Alternatively, we could implement this feature exclusively in the Log Analytics, and make this tool clever enough so that it will only send each Log Line's tracking data once to the Piwik Tracking API.

The Log Analytics Python app could for example keep track of the list of log files that were imported before, as well as a list of the request ID /hashes of all the log lines that were imported before, indexed by date or so. Maybe in SQLite database or so.

pros: maybe easier to implement.
cons: this will work only when people import their data on one server only (when several servers are using log analytics they would not share the "request id database" amongst them so may import the same data.
Summary

this feature would be awesome to have, and will make log analytics much more flexible and easier to use and setup.

What do you think?

cweiske commented 7 years ago

Simply keeping track of the last imported log line of a file would suffice. Only records after this log line would need to be imported.

To solve the issue of log file rotation (last tracked line not contained in log file anymore, so nothing is imported at all), the first line of the log file would also need to be tracked. If the first line is not the same as the tracked first line, then the log file has been rotated and all of the file needs to be imported. Alternatively it the creation date of the log file could be tracked. If that changes, it would indicate a rotation, too. Not sure if all log rotators create new files though.

Maybe look into how the since command works.

mattab commented 7 years ago

Simply keeping track of the last imported log line of a file would suffice. Only records after this log line would need to be imported.

Yes it would already be very useful to implement this simpler solution...

glatzenarsch commented 7 years ago

is this sollution already implemented in 2.17 or this is just planned for future version that are you working on? ty

fwolfst commented 7 years ago

@mattab Thanks for clear issue description.

I am all in for an improvement. I recall that in my tests not so long ago importing the same file twice did not result in doubling of view count, but I might be mistaken. Where there any news on this issue?

In my use case it is important to be able to import log files that were building "gaps", due to network or other issues. Say, I import weekly via cron and realize that one week in the last month was skipped. That is why @cweiske s approach would not help ME very much.

I think one can line up the use cases and solutions and sort them kind-of from "not bad" to "clean" (imho @cweiske s fair proposition is on the "not bad" range).

Obviously the problem is that many different use cases exist, and the cleanest implementation has probably been layed out by @mattab already. For my usecase I would chip in another "not bad" approach/workaround, that is keeping a hash of the entire log file in Log Analytics only. That would already solve my issue. Bonus points for storing the first and last request time with it, so that it could be made easy to DELETE all the visits of a particular log file (say, the log format changed/was extended and we want to add more information in hindsight). In every case, log analytics should implement a --force-XYZ (XYZ should be a specific name) flag to override any clever logic that suddenly is not as clever as needed.

ilmtr commented 7 years ago

This feature is very much needed. In my setup (semi) realtime logs are added about every hour to a file. After a while this file is archived to a gzip file. This archiving is done unpredictably every 6 hours to 3 days.

AlexeyKosov commented 6 years ago

When can we expect this feature implemented?

glatzenarsch commented 6 years ago

Im not sure if this feature is implemented yet but im interested when would it be. I am runing Matomo 3.3 on cca 50 websites with crons going every hour importing same active apache access log and it would be nice to know that results are reliable not duplicated :)

thank you

mhow2 commented 6 years ago

Hi, Let's speak in terms of workaround as I can feel this feature is not yet close to be fulfilled. One did the bad move and imported the same log file twice : how do you manage to fix the mistake ? Can you delete report for a given period of time ? is core:invalidate-report-data might be of any help ? Is anyone that have good experience of this could share and publish an entry in the FAQ ?

Synchro commented 6 years ago

I didn't solve it - I stopped using matomo. I'm not at all interested in JS-based tools, I only want offline log analysis.

One really good tool that helps with log files is AWStats logresolvemerge.pl. This utility can take any number of log files (compressed or uncompressed) and merge them together, removing duplicates, sorting lines by timestamp and performing DNS lookups. It really "just works", and is also pretty fast. Matomo could do with a tool like that, so it could either be ported, or simply used as is, despite language mismatch.

I once wrote some utilities in PHP for processing logs, allowing split/merge/filtering by date, including using strtotime, which allows you to do nice things like "find all entries from the last 2 days" while not having to worry too much about the timestamp format. I'll see if I can find them again.

mhow2 commented 5 years ago

Thanks @Synchro for the reference. However I don't see how this AWStat's perl script solve the issue for log entries that are already records in matomo's database. @mattab do you have any hints/input about this ? Otherwise I'll be forced to track somewhere (outside matomo) which files have been successfully imported (if I can assert what is a "sucess").

Synchro commented 5 years ago

It doesn't solve it directly, but it can help avoid problems - for example you can make a real mess of your Matomo database if you import log files in the wrong order, but if you pass them through that first it avoids the problem.

mackuba commented 5 years ago

I've added a PR that adds a new --timestamp-file option that solves this problem by tracking the last processed timestamp in a selected file and then ignores logs up to that date: https://github.com/matomo-org/matomo-log-analytics/pull/232

mddvul22 commented 3 years ago

Having the ability to re-import old logs for a very specific purpose would be very useful. If one is deleting old raw data to save space, it is my understanding that you lose Transitions and unique visitors metrics for the timeperiods in which the raw data has been deleted. But if we could re-import these old logs, should we need access to older Transitions data, without altering existing report data, it would be very beneficial.

danemacmillan commented 3 years ago

Darn. In upgrading to Matomo 4.2.1 from 3.14.1 I put the site in maintenance mode and disabled tracking during the DB upgrade, which took seven hours. I had nginx's access log turned on, so I collected all the information. Ran the importer for the first time, and out of 600k log entries, only 4k were imported (corresponds with the brief period where the upgrade completed but I had not turned off the access log). Figured it had something to do with the fact that most rows were recorded as 503, so I ran it again on a small sample of five 503s and five 200s to see what would happen. The results lead me to believe I was correct, so I checked the available options and saw the --enable-http-errors option, so ran it with that and now all the rows are being inserted.

I assume this means I will have at least 4k duplicated rows for the day?

Erwane commented 2 years ago

I'm pretty sure AWstats already doing that by logging the last analyzed line number in a file and skip those lines if same file is re-analyzed. This allow "real-time" analysis by a cron, without duplication.

This data could be stored "by-site" with filename and line number field and returned by API to import_logs.py

strager commented 2 years ago

I've been happy with #232 as a solution to this problem. (Thanks @mackuba!)

I'm pretty sure AWstats already doing that

Does Matomo uses AWStats? Or are you suggesting that Matomo use AWStats?

Erwane commented 2 years ago

none :) i said Awstats can do that and Matomo can look how it does it and try to include it

Synchro commented 2 years ago

Curious to see this old ticket come back to life! I don't think awstats keeps any additional files, it just relies on its log preprocessor being very fast.

fabianboerner commented 1 month ago

is this really open since 2016? This is a big issue why is that not picked up?

mackuba commented 1 month ago

FYI, I've been stuck on Matomo 3.x for a long time, but I finally migrated my setup to 4.x and 5.x recently, so I'll try to update my PR #232 at some point soon.

ohkjames commented 2 weeks ago

+1 here, I've been important huge weekly logs into Matomo for a new client and somewhere in between, my SSH terminal got closed. Would be very helpful if I could just re-rerun the entire batch import and Matomo would skip those already imported.

matomo-org / matomo-log-analytics