logstash-plugins / logstash-filter-date

Apache License 2.0
7 stars 43 forks source link

Default year, month and date when not specified #51

Open ppf2 opened 8 years ago

ppf2 commented 8 years ago

Currently in LS 2.1, the following is the behavior when the date filter encounters a datetime field without the full "date" portion of the datetime.

Example 1:

When no date information is provided, and only the time is available, we default the year to 1970, month to 01 and day to 01.

15:52:54,138 -> 1970-01-01T23:52:54.138Z

Example 2:

When no year information is provided, and only the month and day is available, we default the year to be the current year (or based on the logic here)

10-03 15:52:54,138 -> 2015-10-03T22:52:54.138Z

Example 3:

When no year and month information are provided, and only the day is available, we default the year to be the current year and the month to be 01

03 15:52:54,138 -> 2015-01-03T23:52:54.138Z

It may be nice for the user to define what they want the year and month to default to. The 2nd and 3rd cases above seem to have sensible defaults - year to current year (or based on logic in PR 4), and month to 01. But the behavior of the 1st use case seems inconsistent since it defaults to 1970 instead of the current year (perhaps we can make it configurable for the end user to decide what they want this (or a datetime with no actual date information) to default to?)

ppf2 commented 8 years ago

I am going to tag this as a bug also. Making it configurable is probably a feature request, but defaulting to 1970-01-01 for the use case of having only time (not date) information doesn't seem to make sense.

jordansissel commented 8 years ago

I agree the behaviors are weird. On the other hand, I do not like guessing. I will explain.

Some background: We currently do one kind of guessing to fudge the date in hopes of making it correct and that's across year boundaries. For example, if Logstash's current time is Dec 31 2015 and it receives a time with no year but "Jan 1" as the month and day, it will hope the year is correctly to be 2016. The reverse is true, too (If Logstash thinks it is Jan 1 but receives a log with month day of Dec 31).

Logstash is used in both real-time and batch import scenarios, and it feels like this "time guessing" is only really meaningful for real-time processing when we also hope that the system clock is set correctly on Logstash and other systems.

The guessing becomes complex, but perhaps if we enable this behavior and make it such that people can disable it, it could help. If we also define exactly what hopeful-corrections are made, and when, it could help clarify the behavior.

As an example of how this can go wrong if we guess wrong -- Logstash believes the current hour to be 23 and receives a log with hour 00 - is the day "today" or "tomorrow" for this log? How large are the tolerances for these boundary conditions? Thoughts?

ppf2 commented 8 years ago

Certainly agree on the guesswork being complicated. It is probably sufficient to simply document our current "guess" logic. I think the guesswork for most of these examples are sensible today, other than the first example where there is no date at all (just time info). For this case, it seems better to default to the current date vs. 1970-01-01 for new log lines writing out just time information is more likely to be today's date. Having it default to 1970 means that these log events will likely be missed from user queries since most users will not be looking for events that are that old.

jordansissel commented 8 years ago

@ppf2 I kind of agree with you, but I'm almost leaning towards maybe rejecting date filter configurations that are missing too much data (no month, day, year, etc) rather than guessing "now" as the correct time (and still be incorrect in many situations).

This is a funky situation. Maybe the first step is to document that behavior is undefined if time formats that are missing important data. Second step maybe to add behavior that rejects time formats without enough precision and add a setting that allows logstash to guess (and document the guessing mechanisms).

suyograo commented 8 years ago

@jordansissel @ppf2 I like giving the user an option to configure the default year, if no date component is present in the logs. Something like missing_year_default with values of current, 1970. The default can continue to be existing behavior -- inititalize year to 1970, but this option could init it to current year. This will also preserve backward compatibility.

Thoughts?

jordansissel commented 8 years ago

@suyograo my hesitation in that solution is that we already have had bug fixes to "assume current year" - our current behavior for a missing year is to try and guess what the year was supposed to be -- like if the real time is January but the log says December, we say year is "previous year" (2015).

We can do similar for boundary cases, where if there's a missing day, we assume the day is "today" unless the hour appears to be too far in the future (like if the current hour is 0 and the received hour is 23, we can assume maybe the actual day for this event is "yesterday"). Boundary cases are tricky where the real time clock progresses at a different rate than the clock reported in each log.

ppf2 commented 8 years ago

What about missing_year_default with 3 options:

And document that "current" option is really a "best effort" guess - which is why we also provide a custom option for the user to specify the default year to set to when no year, date and month information are available. This will give it a better behavior than what we have today (1970) which is not that useful. Thoughts? :) @jordansissel @suyograo

suyograo commented 8 years ago

like if the real time is January but the log says December, we say year is "previous year"

I agree that our guessed year might be off for scenarios like above, but having everything default to 1970 will mean all those logs are not searchable (especially if you use Kibana and other UI). Like @ppf2, I agree that having a best guessed year is more useful.

How about we meet in the middle :) - tag log event saying we guessed it (year_defaulted_current) and default to current calendar year without any boundary cases? Some logs may end up in the future (case: December in logs, but current date is Jan, so it becomes Dec 2016), but that maybe still ok compared to having it in 1970.

@ppf2 custom format is dangerous because if you make this 2015 and leave it in your config file, you'll always default to this year, even if the calendar year moves to 2017. Users leave things in config and forget about it, so IMO make it simple -- current or 1970 :)

jordansissel commented 8 years ago

I don't even think it should be configurable, my opinion on values:

For month and day, we want similar things:

(*) The known cases, to me, are where real time clock and log clock are different, such as when log clock is in the future or log transmission is delayed and real time clock is ahead of the log clock. In both cases, we can add tolerance to allow for rollover of units.

Let's do two things to fix this: 1) Always use "current time" as the basis for our clock if the log clock is missing significant data such as year, month, or day. 2) Let's make a table of all the edge and rollover cases and implement those. For example:

real time log time computed time condition
2016/01/01 00:00:00 23:00:01 2015/12/31 23:00:01 Missing day, month, and year. If we assume current-time, then the hour (23) puts us far in the future, and we could guess this case means that it really means "Yesterday" so the computed values (year, month, day) are from the previous day.

Let's make a table of the following cases:

jordansissel commented 8 years ago

One other constraint, with @guyboertje exploring a file input that can read files once (for batch input), we will add an additional difficulty that for logs missing important data (day, month, etc), we are even more ilkely to guess incorrectly.

suyograo commented 8 years ago

+1, i like the idea of:

1) Always use "current time" as the basis for our clock if the log clock is missing significant data such as year, month, or day.

with no configuration and try to deduce ambiguity/drifts

ppf2 commented 8 years ago

+1 thx!

cyril-steimer commented 8 years ago

I think this should certainly be configurable. In our setup, we essentially have a date field repeated three times: Once as the full date + time, once as the date only and once as the time only. As we have the timestamp set to date + time, there is no issue with not finding old data. However, using the time only field allows us to easily look at performance depending on time of day (which to my knowledge can't be done with e.g. scripted fields in Kibana using the date + time field). With Logstash 2.2.0, suddenly these are now parsed to be in the year 2016 instead of 1970 as previously - breaking that visualization.

foonix commented 7 years ago

How about determining the ambiguity from the format string, and add an option to control disambiguation behavior?

Format string Unknown interval
"MMM dd YYYY HH:mm:ss" none
"MMM dd HH:mm:ss" year
"dd HH:mm:ss" month
"HH:mm:ss" day
"mm:ss" hour
"ss" minute
option default allowed
default_time now Time or field reference to a field containing a time.
disambiguate "nearest_default" "nearest_default", "from_default", or "discard"

In the case of disambiguate => "nearest_default", if the incoming timestamp is more than half the unknown interval into the future, consider it to be a part of the previous interval. This creates a sliding window where times too far in the future are considered to actually be from the recent past. For example, if the date is unknown, and the time parsed (assuming same date as default_date) is greater than half a day ahead of default_time, then subtract one day. Let date math libraries sort out date boundary edge cases. Then, a message sent at 23:59 on one day and parsed at 00:01 the next day will be correctly evaluated as belonging to the previous day, even on a leap day.

In the case of disambiguate => "from_default", any missing fields outside of the unknown interval are copied from default_time. This is meant for bulk loading historical data where the ambiguous parts are known by some other means.

disambiguate => "discard"means that default_time is used directly if the date can't be parsed or is ambiguous.

Some examples of "nearest_default". This supports use cases where the incoming message is probably close to the current time.

format incoming string default_time window cutoff guessed time
MMM dd HH:mm:ss Dec 31 23:57:00 2016-01-01 10:00:03 2016-07-02 10:00:03 2015-12-31 23:57:00
"HH:mm:ss" 23:57:00 2016-01-01 10:00:03 2016-01-16 10:00:03 2015-12-31 23:57:00
"mm:ss" 57:00 2016-01-01 10:00:03 2016-01-01 11:00:03 2016-01-01 9:57:00
"mm:ss" 57:00 2016-01-01 00:00:03 2016-01-01 00:30:03 2015-12-31 23:57:00

Same example inputs but with disambiguate => "from_default".

format incoming string default_time guessed time
MMM dd HH:mm:ss Dec 31 23:57:00 2016-01-01 10:00:03 2016-12-31 23:57:00
"HH:mm:ss" 23:57:00 2016-01-01 10:00:03 2016-01-01 23:57:00
"mm:ss" 57:00 2016-01-01 10:00:03 2016-01-01 10:57:00
"mm:ss" 57:00 2016-01-01 00:00:00 2016-01-01 00:57:00