logstash-plugins / logstash-output-elasticsearch

https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html
Apache License 2.0
219 stars 305 forks source link

How should output-elasticsearch deal with a missing @timestamp? #779

Open webmat opened 6 years ago

webmat commented 6 years ago

Events without @timestamp can be difficult to process in certain situations. According to code that dates back to 2013 (and probably before), its presence should be guaranteed. Logstash has never enforced this by preventing the removal of @timestamp per se, however.

So whenever a problem arises from a missing @timestamp, people may simply see an innocuous error like the following, and conclude that there's a bug in Logstash:

[2018-02-22T13:49:42,425][FATAL][logstash.runner ] An unexpected error occurred! {:error=>#<LogStash::Error: timestamp field is missing>, :backtrace=>["org/logstash/ext/JrubyEventExtLibrary.java:205:in sprintf'", "/Users/Gabriel/Documents/ElasticSearch/logstash-5.6.2/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-7.4.0-java/lib/logstash/outputs/elasticsearch/common.rb:169:inevent_action_params'", "/Users/Gabriel/Documents/ElasticSearch/logstash-5.6.2/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-7.4.0-java/lib/logstash/outputs/elasticsearch/common.rb:44:in event_action_tuple'", "/Users/Gabriel/Documents/ElasticSearch/logstash-5.6.2/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-7.4.0-java/lib/logstash/outputs/elasticsearch/common.rb:38:inmulti_receive'", "org/jruby/RubyArray.java:2414:in map'", "/Users/Gabriel/Documents/ElasticSearch/logstash-5.6.2/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-7.4.0-java/lib/logstash/outputs/elasticsearch/common.rb:38:inmulti_receive'", "/Users/Gabriel/Documents/ElasticSearch/logstash-5.6.2/logstash-core/lib/logstash/output_delegator_strategies/shared.rb:13:in multi_receive'", "/Users/Gabriel/Documents/ElasticSearch/logstash-5.6.2/logstash-core/lib/logstash/output_delegator.rb:49:inmulti_receive'", "/Users/Gabriel/Documents/ElasticSearch/logstash-5.6.2/logstash-core/lib/logstash/pipeline.rb:436:in output_batch'", "org/jruby/RubyHash.java:1342:ineach'", "/Users/Gabriel/Documents/ElasticSearch/logstash-5.6.2/logstash-core/lib/logstash/pipeline.rb:435:in output_batch'", "/Users/Gabriel/Documents/ElasticSearch/logstash-5.6.2/logstash-core/lib/logstash/pipeline.rb:381:inworker_loop'", "/Users/Gabriel/Documents/ElasticSearch/logstash-5.6.2/logstash-core/lib/logstash/pipeline.rb:342:in `start_workers'"]}```

One point of view about this is that the event should actually be considered invalid and whatever lead to this situation is likely to cause data loss.

There are two schools of thoughts or approaches we can take for these:

  1. We can try to support these events as best as possible
  2. We can take a hard line and not try to support these events, and assume that the customer's configuration has a problem that will lead to data loss. We can adjust the level of "severity" we take with this approach.

In either case, in order to prevent needless confusion and support requests, we should at the very least output a log message that's a bit more verbose than the one above, in explaining what the problem actually is. With this event it's missing @timestamp and the customer tries to index it to a timestamped index (like logstash-%{+YYYY.MM.dd}).

Are non-timestamped events legitimate?

The behaviour currently (unless the above situation happens) is to successfully index the event. I would argue there's a legitimate use case for removing @timestamp in Logstash pipelines. It's counterintuitive because Logstash was born to process time-based data. But once someone has a Logstash installation, Logstash is a great way to ingest non time-based data periodically to their ElasticSearch cluster.

Examples of non time-based data to reimport regularly:

Another supporting argument for the legitimacy of non time-based support in Logstash is the fact that our plugins are gradually getting better at supporting non time-sensitive scenarios:

The last point in demonstrating the legitimacy of processing non-timestamped data with Logstash is the very first Logstash configuration one sees, during the ElasticSearch training. Remove noise from Logstash events Here we're ingesting bunch of data, but we're not interested in keeping @timestamp (among others).

Approach 1: Support these events as best as possible

The current behaviour is that events with no @timestamps work in most cases. People can do this, and are doing it. There are still some improvements we can make:

In any case, nobody denies that we can improve the error message when the timestamp error happens.

Now, do we want to take a harder line to the effect that we do not want to support a missing @timestamp at all? If so, the current behaviour is not strict enough. We may want to take the following additional steps:

Why this discussion?

This issue came out of the discussion here #777, which in turn came from reports such as #739. Requesting feedback on #777 in Slack immediately led to discussions of going towards approach #2.

jsvd commented 6 years ago

Just for reference, this issue happens because sprintf time interpolation behaves differently from normal interpolation:

2.3.0 :006 > e = LogStash::Event.new; e.sprintf("%{[hello]}")
 => "%{[hello]}" 
2.3.0 :007 > e = LogStash::Event.new; e.remove("@timestamp"); e.sprintf("%{+YYYY}")
LogStash::Error: timestamp field is missing
    from org/logstash/ext/JrubyEventExtLibrary.java:168:in `sprintf'
    from (irb):7:in `<eval>'
webmat commented 6 years ago

In case anyone is watching this issue, I just posted a massive update to the description of the issue. The focus is now solely on @timestamp, to keep the discussion focused on the problem at hand. Mentions of @version have been removed.

Also, I brought back my original approach to the matter on the table (trying to accommodate for the missing @timestamp). After initial feedback, I thought this approach was actually not gonna fly with anyone. But after the short discussion in the ingest meeting, I realized that I may not be alone in thinking that's a valid approach, after all :-)

webmat commented 6 years ago

In my mind, approaches 1 and 3 are tied as my favourites. I'd start with 1 because it's less work (it's essentially the current state + possible improvements), then if we want, we can continue working towards option 3.

My 3rd favourite approach is # 2, but I can live with it. However if we want to go the opinionated route, we'll have to be clear with our users that use cases removing @timestamp will no longer be accepted, and that this behaviour will be getting more severe with upcoming versions.

andrewvc commented 6 years ago

Thanks for the excellent writeup @webmat .

Whatever solution we come up with here, we should consider consistency with sprintf generically as @jsvd mentioned.

My suspicion here is that there is no one right answer here. What are your thoughts WRT creating two separate sprintf impls for plugins to use sprintf_strict and sprintf_lax, where sprintf_strict raises exceptions on missing fields (at which point it could be DLQed or whatever the author decides), while sprintf_lax always succeeds, but may include #{...}. This would mean that there would be consistency based on function call.

You'd use sprintf_strict for things like the ES index name interpolation, and sprintf_lax for everything else. We'd obsolete plain sprintf in a major.

colinsurprenant commented 6 years ago

Opposing concepts:

Field removal

A proposal

WDYT?

andrewvc commented 6 years ago

@colinsurprenant I agree that @timestamp should be removable.

I'm worried about that approach due to the amount of magic there, it seems like it violates the principle of least surprise if you can remove something but it still exists in some contexts.

What did you think of my proposal to let plugin authors decide what level of strictness was needed for interpolation on a case by case basis?

colinsurprenant commented 6 years ago

@andrewvc seems to add more complexity to me. OTOH what I propose is basically surfacing a fact that is true today which is that the timestamp has some special behaviour: we even have a Event.getTimestamp method. I thins this can be documented and for the principle of least surprise I think we could argue that my proposal does satisfy it since the intention is not to have the field appear in its serialized form which is the majority of cases where a field is removed and no change would be required in plugins.

It seems to me that conceptually it will be easy to explain and understand for the users that there is one special field in the Event and it has to do with the timestamp handling and the reason is because some features and plugins expect a timestamp to be present but regardless, this field does not need to be included in the event serialized form, by removing it.

andrewvc commented 6 years ago

@colinsurprenan That's a good point. I see that it does solve the problems you mention. It does have the cost of being yet another 'special' thing to document.

Here's a thought, what if we deprecate @timestamp and say that the canonical place for it is [@metadata][timestamp], which doesn't serialize by default. @metadata would then be the single place where you put data that doesn't serialize. Thoughts?

We'd also adjust sprintf for timestamps to treat that like an ordinary field.

andrewvc commented 6 years ago

To clarify what I mean by 'like an ordinary field', I mean that we'd follow the same interpolation rules. I do think we need a strict interpolation variant regardless of the timestamp decision to let plugin authors decide if interpolation issues are fatal.

andrewvc commented 6 years ago

An alternative approach would be if we had a rule that any field that starts with @ is not serialized. That would be another way to have simple event behavior without multiple rules.

colinsurprenant commented 6 years ago

Having the timestamp in [@metadata] is also worth considering and was basically what I verbally suggested in the last sync. The disadvantage is that it does break the current behaviour and will require ajustements in various places and this is why after giving it some thoughts that I did not suggest that here.

The problem we face is that over time @timestamp has become a special case field and rightfully so because we enforce the fact that an Event is always initialized with a valid timestamp and timestamp handling has always been considered "a special case in itself" to deal with. Many plugins reference @timestamp is various forms (as a hardcoded reference, as a default configuration option, etc) and changing the interface to the Event timestamp will certainly require lots of code change and more explanation to the users.

This is why I still think that we should not change the interface to get the Event timestamp using the [@timestamp] field ref and it should simply always return a value but provide the ability to not include it in a serialized form by using remove_field. Again, the only exception this introduces is that after doing a remove_field on [@timestamp] you'd still get a valid value because we'd now consider [@timestamp] the always working interface to the Event timestamp and introducing this single exception would probably not break anything or require changes. For the vast majority of cases this would be seamless for the users because it would satisfy the intent. It may surprise a few users that actually manage to test that [@timestamp] is correctly removed after a remove_field on it but frankly I think this will be exceptional and they'd only need to rework that test to verify the serialized output and not the field ref for [@timestamp].

webmat commented 6 years ago

This comment is not a response to the recent comments (that's coming after).

If anyone strongly in favour of Option 2 wants to flesh out this section more, in the issue body, you're welcome to do so. It's never how I used Logstash personally. So while I understand the point of view, I'm probably not making all of the arguments that can be made.

webmat commented 6 years ago

I feel like the current discussion always goes really deep right away.

But I'd like to shed light on one assumption that seems to be underlying all of this. That @timestamp is special, and that "things will break" if it's not there (these quotes represent a cloud of mystery :-) ).

I'd like to flesh out clearly what it is that breaks. I can start with the obvious one, which triggered all of this.

Breaks:

  1. Trying to index an event that doesn't have a @timestamp to a timestamped index name will fail. 1.1. This applies to output-es, and possibly other plugins that need @timestamp to figure out what to do with the event.

If we fix this (assuming we try the lenient route), what else could this break? Is this it, or are there other things I'm not seeing?

webmat commented 6 years ago

@andrewvc If we were to introduce new sprintf implementations, perhaps we could consider only introducing a strict one.

Right now the behaviour is:

The new behaviour, if we take that route could be:

----------

Reset.

Now that I phrase it this way, I wonder if multiple implementations of sprintf are a solution to this. Part of this discussion is "should these situations crash Logstash or not". Presumably, Logstash users are the ones who should decide this, rather than us.

Introducing multiple versions of sprintf would mean that potentially many plugins need to be updated the right way, to actually give Logstash users this choice. Plugin needs to 1) start using both versions of sprintf 2) give users an option to decide severity.

Perhaps we should be considering adding a Logstash-level option to decide behaviour on interpolation failures, instead?

Options:

The behaviour we have in current versions of Logtstash (at least wrt to output-es) is:

# logstash.yml
interpolation.field: none
interpolation.time: crash

Given the different role of different plugins, interpolation_field and interpolation_time could also become Common Options, where users can selectively override the behaviour. This way someone building a really strict pipeline can still get leniency around some parts of their pipelines.

I haven't fully thought through the implications here. But I think this more directly addresses "should we crash" or not. With this approach, the answer becomes the beloved "It Depends ™" (but you decide).

colinsurprenant commented 6 years ago

We have definitely broadened the discussion away from only the output-es problem, but that's a good thing. Maybe we should move the discussion into a logstash issue.

Generally speaking when I referenced things that will break, in the context of trying to fix the larger issue by getting rid of the @timestamp field I meant that:

This is why I believe that the @timestamp field ref as the interface to read/write the Event timestamp should be kept as-is and always working regardless if remove_field was done on it.

webmat commented 6 years ago

One thing's for sure, moving @timestamp to @metadata.timestamp would be a breaking change that affects a lot of plugins. The sum of plugins that would suffer from this would be a big hindrance for people considering upgrading. We may be diligent in fixing all supported plugins, but a portion of the unsupported plugins would be forever left behind...

webmat commented 6 years ago

@colinsurprenant If we made @timestamp resilient to deletion by hiding it, what would the behaviour be, here? ;-)

  if [@timestamp] {
    # ...
  } else {
    # ...
  }
andrewvc commented 6 years ago

OK, so new thought having chewed through all the excellent points both @colinsurprenant and @webmat have made.

I particularly like your proposal for per-field support @webmat but I think doing an extra file is awkward. What if we implemented it at the LS config language level? What if say %{} acted as it does today, but #{} was the strict variant? That puts all the power into the hands of the user. For timestamps, you'd either write %{+...} or #{+...} to get lax or strict variants. To solve the serialization issue, we could extend the syntax to support timestamp interpolation of arbitrary fields, so you could do #{[@metadata][timestamp]+yy.mm.dd) to get strict interpolation of a non-serialized field, for instance.

So, the guidance for a user not wanting to serialize the timestamp, but to use it, would be to use that sprintf format, and to move the timestamp from @timestamp to @metadata, which is already the place where things are not serialized.

webmat commented 6 years ago

Yes! I like the idea of having two ways of interpolation. A strict one and a lax one.

Tongue in cheek caveat: Now Rubyists will get confused by two things, rather than one. The hash syntax without the commas, and now the look-alike string interpolation delimiters that actually have no relation with Ruby's 😆

Kidding aside, I like this. It gives maximum granularity (both can even be used in one string template, at the extreme), without introducing anything scattered all over the place (logstash.yml, plugin args). It's just two different delimiters. Love it!

andrewvc commented 6 years ago

@webmat that's a good point, we already have %{} and ${}, so adding a third character is inherently tricky. I suppose it could be any of @^&#*~ as a sigil. I don't have a particular preference. It's too bad $ is taken, because you could remember it as $trict. WDYT?

The strict behavior, BTW, should be to either log and warn, or go to the DLQ IMHO.

colinsurprenant commented 6 years ago

@andrewvc @webmat hypothetical question: let's say that @timestamp is always readable, would the strict interpolation still be necessary/useful?

colinsurprenant commented 6 years ago

@webmat the behavior with

  if [@timestamp] {
    # ...
  } else {
    # ...
  }

would be to never take the else branch obviously. I'd be extremely surprised to see anything like that in the wild though because first [@timestamp] always exist at Event creation and second someone removing @timestamp does so for it to not be part of the serialized output so there is no real point in doing such a check in a config.

andrewvc commented 6 years ago

@colinsurprenant I think it's a useful feature on its own yes. It'd actually still be useful for @timestamp fields.

Let's imagine a situation where the real timestamp is in [custom][foo]. The user might have logic like overwrite @timestamp with the value in [custom][foo], if that is null, however, and we always guarantee a timestamp, what then? Do we just not overwrite it? Log a warning?

The exciting thing about the approach of having the user pick the syntax is that it puts the choice of behavior in the hands of the user. It also is a generally useful enhancement, which is great because it reduces the total complexity of the system. Most importantly, however, it keeps the usage of field removal/writing consistent. I'm not against having exceptional cases, but exceptional cases often indicate an impedance mismatch in the modeling of a problem. If we can solve a problem without that and put the user in control, that feels like a win.

colinsurprenant commented 6 years ago

@andrewvc lets keep in mind that we already have a different behaviour for @timestamp handling:

andrewvc commented 6 years ago

@colinsurprenant it makes sense to reduce the number of special cases, and the amount of complexity. We should probably remove all those special guarantees. IMHO we should just make @timestamp a normal field, and make the auto-population of it configurable.

Let's imagine we're teaching a course on Logstash, we could either explain:

  1. By default, Logstash will create a Timestamp value called @timestamp for you.
  2. Logstash will always create a Timestamp value called @timestamp. This field is not removable and can only contain a Timestamp type, setting it to anything else is an error. If you invoke a remove on the field it will still be present, for reads, but will not be serialized.

Moving to it being a normal field would actually clean up a lot of weirdness in how examples work. In some examples we read a string version of a timestamp into [timestamp], and then use the date filter to move it into [@timestamp], then finally delete [timestamp]. It'd make more sense to read the string straight into [@timestamp] and then immediately convert the type in place with the date filter IMHO.

One other thought, have you considered that the 'hide-able' approach makes the logical nullity of timestamps weird to express. If you want to express "I don't have a timestamp" then you need to make some other field for that called [no_timestamp]. Imagine a user who expects the origin of requests to always provide the timestamp. Our 'default' behavior then gets in the way and they need to hack around it.

colinsurprenant commented 6 years ago

I guess It is a question of how much we need to change to solve which scope of problem, in what timeframe taking into account the BWC issues. The point of my proposed solution is to be simple, applicable today, backward compatible and does not require any modifications in the plugins or user configurations. But the tradeoff is to have a exceptional behaviour when removing the @timestamp field (which I tried to show it is a minor tradeoff IMO because the intent of removing that field is preserved, but that fine if opinions diverge here).

It's all good if we agree to go wide and completely redefine the timestamp handling and behaviour in logstash and plugins too.

jsvd commented 6 years ago

Considering logstash 7.0 is far away but not too much, this is a great time to be considering breaking changes and the upgrade+deprecation path for them in the 6.4/6.5 minors.

IMO, I think that moving the timestamp field to the @metadata (or another meta namespace) and making all sprintf calls lax would remove surprises in sprintf behaviour, solve the default serialization issue and keep the current timestamp related apis in the event class working as before.

BWC-wise, this shouldn't break almost any plugin, as they should be using event.timestamp = or event.set_timestamp anyway. Grepping for @timestamp on our default plugins shows a very small list of hits. One thing we'd have to consider is the default schema we push to elasticsearch, that currently contains @timestamp, but that could be solved at the es output plugin level.

Tbh I don't see an urgency is solving this in any immediate time frame, this situation pops up once in a while and we could just improve the wording in the exception we output, but otherwise we have the breathing room to make "the right decision" without pressure to deliver a fix.