log import: avoid importing duplicates

sebalis commented 6 years ago

If the import script is called on the same logfile more than once, entries imported in the first run and still present in the file during the second run are imported twice (at least when I tested it on Apache logs). This creates a few problems: running the script has to be tied to the web server rotating the file, and new entries can not be seen in Matomo until the log has been rotated and the importer run again.

It would be much better if it were possible to run the importer frequently, maybe even every minute. In order to do this, the importer needs to be able to know which entries have already been imported. I don’t know how to implement this in detail. I suppose it should be reasonable to expect that the log entries have time data with one-second granularity. So by remembering the time of the latest entry already imported and perhaps adding some logic for re-identifying entries from that last second, it would become feasible to run the importer as often as one would like without messing up the data. What do you think? For me this might make the difference between choosing Matomo or some other log analysis software.

fdellwing commented 6 years ago

There is no other open source analystic software with this many features, but I'm with you non the less.

Personally I'm importing access.log.1 every day at 0:00 and log rotating is happening at 5:00. So I have 2 day old logs, but no duplicates,

sebalis commented 6 years ago

Matomo looks very good, I would like to use it – although for privacy reasons I will restrict myself to the log importer, which reduces the difference to other products. Also this makes it all the more important to get as much accuracy and timeliness out of the importer as possible.

fdellwing commented 6 years ago

To respect privacy you should imho use the JS tracker because he respects DNT and other tracking blockers while log import will not do that.

sebalis commented 6 years ago

Using JS trackers is out of the question – I do see your point about DNT but let’s not even begin to discuss that. And with my concerns I do of course anonymise my logs (by zeroing the final two bytes of the IP).

mackuba commented 6 years ago

I see that the import_logs script has an --exclude-older-than option (added in December) - would that work, with some kind of "last import" flag that's kept in a file and updated whenever the log is parsed, and then passed to that option? Anyway, I'm planning to set it up this way myself :)

sebalis commented 5 years ago

Sorry for the late response. I havent’t tested it as my workaround was to set up a job to import the ‘.1’ log file (the first ‘rotated’ one). This was possible since the rotation at this site takes place at regular intervals. But it does seem that this option would work. I might be interested in using it for another case where the logs do not rotate so regularly but don’t know when I will get round to it. Feel free to close the issue.

sebalis commented 5 years ago

One minor quibble: --exclude-older-than t₁ would appear to restrict the import to records with a time t ≥ t₁. If I have imported a logfile I know the time t₀ of the latest record I have imported, so it would be convenient to restrict to t > t₀. It seems like I will have to calculate t₁ = t₀ + 1s in order to use this option. Something like --only-newer-than would be slightly better.

mackuba commented 5 years ago

I've actually implemented a PR doing something like this in the meantime: https://github.com/matomo-org/matomo-log-analytics/pull/232

And yes, I tried to use --exclude-older-than first until I noticed I need to do > and not ≥ 😉

My first approach was to find the last timestamp using a separate script in Ruby, but then I realized that the import_logs.py is already going through all lines and parsing dates and stuff from them, so it makes more sense if it finds the timestamp during the import.

tsteur commented 4 years ago

This is a duplicate of https://github.com/matomo-org/matomo-log-analytics/issues/144 I think.

sandrocantagallo commented 11 months ago

There is a parameter in the script:

--exclude-older-than

So if you run crontab once at day you can put in param the yesterday date and hour to exclude data in log older (for example becouse log restart every 7 day).

For print yesterday date on linux: date -d "yesterday" +"%Y-%m-%d %H:%M:%S %z"

matomo-org / matomo-log-analytics

log import: avoid importing duplicates #264