vectordotdev / vrl

Vector Remap Language
Mozilla Public License 2.0
133 stars 62 forks source link

Vector stops with panic at to_unix_timestamp format:nanoseconds VRL function #978

Open r3code opened 3 months ago

r3code commented 3 months ago

A note for the community

Problem

Vector panics at timestamp conversation to_unix_timestamp and crashes. Log messages:

vector[582913]: thread 'vector-worker' panicked at 'value can not be represented in a timestamp with nanosecond precision.', /cargo/registry/src/index.crates>
vector[582913]: note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
ox2-vctr-srv01 vector[582913]: thread 'vector-worker' panicked at 'internal error: entered unreachable code: join error or bad poll', src/topology/builder.rs:890:30
ERROR transform{component_kind="transform" component_id=check_unix_timestamp component_type=remap component_n
INFO vector: Vector has stopped.

We found out that it was a very small date in some of the logs. This one message has crashed all our 3 aggregators. How: we have Kafka between agents and aggregators, one message crashed first vector and was not consumed, then other two after rebalancing tried to consume this message and also crashed. All 3 aggregators were continuously restarting every 10 seconds, until our fix in VRL.

Configuration

Vector 0.33.1 
Streaming model (agent - kafka -aggregators)

if exists(.ts) {
      timestampTs , err = parse_timestamp(del(.ts), "%+")
      if err != null {
          .err = err
      } else {
         .Timestamp = to_unix_timestamp(timestampTs, "nanoseconds")
      }
    }

Version

0.33.1

Debug Output

No response

Example Data

Input: "ts": "1677-09-21 00:12:43.145224192Z" - OK. VRL Playground "ts": "1677-09-21 00:12:43.145224191Z" - Fail. thread 'vector-worker' panicked at 'value can not be represented in a timestamp with nanosecond precision.' VRL Playground "ts": "0000-09-21 00:12:43.145224192Z" - Fail. thread 'vector-worker' panicked at 'value can not be represented in a timestamp with nanosecond precision.' VRL Playground

Reproduced in VRL Playground изображение

Additional Context

No response

References

https://github.com/vectordotdev/vrl/blob/16889935bf08268547dd494054dc906705a1dfc7/src/stdlib/to_unix_timestamp.rs#L4

### Tasks
r3code commented 3 months ago

We have some workarounds:

  1. Replace nanoseconds with milliseconds, it helps. VRL Playground Workaround Demo, with 0000 year
  2. As we expect dates in RFC3339 than year goes first, we work with logs so it's not possible to receive a log from the past Vrl Playground Workaround Demo

But we still want to use nanoseconds precision.

iFurySt commented 3 months ago

This was done on purpose, reference this: https://github.com/chronotope/chrono/pull/1123

1s = 1e9ns

i64 range: -2^63 ~ +2^63 = -9223372036854775808 ~ 9223372036854775807

range in ts:
-9223372036854775808/1e9=-9223372036.854775808s=1677-09-21T00:12:44.854775808Z
9223372036854775807/1e9=9223372036.854775807s=2262-04-11T23:47:16.854775807Z

The timestamp has elapsed since 1970-01-01T00:00:00Z, so it can only use the i64 to represent the ts in ns from 1677-09-21T00:12:44.854776Z to 2262-04-11T23:47:16.854776Z. This means we can only count up or down around 300yr from 1970-01-01T00:00:00Z 🙂

The next Y2K bug will strike in 2262. 🤣

iFurySt commented 3 months ago

But it may be worth throwing errors rather than panic in vector. @jszwedko WDYT?

jszwedko commented 3 months ago

But it may be worth throwing errors rather than panic in vector. @jszwedko WDYT?

Agreed, I think we should catch this and return an error from to_unix_timestamp.

I'll transfer this issue to the VRL repo.

pront commented 3 months ago

Good catch, thanks! I will prepare a fix. I will also do a grep for usages of timestamp_nanos. All those need to be replaced with timestamp_nanos_opt.