Better handling of timestamps for native JSON codec

jszwedko commented 2 years ago

We discussed just parsing any strings that are ISO timestamps into a timestamp type.

jeromekleinen commented 2 years ago

As a user whose logs are fairly fixed in structure, I think I would rather shortlist the fieldnames where timestamp parsing should be attempted automatically. Don't get me wrong, I do think this is a good addition and it will save me some calls to parse_timestamp in my vrl programs, but perhaps this could be controlled by some global setting like timestamp_mode with settings like auto, manual or field_list. The last option would of course require an additional setting to configure the field list where timestamp parsing should be attempted, f.e. ["@timestamp"].

lukesteensen commented 2 years ago

The soak test on #12252 seem to show that transparently parsing timestamps out of strings is feasible from a performance perspective, so the remaining question seems to be around backwards compatibility and whether or not that's a desirable behavior in general.

Compatibility

As far as I can tell, there is one potentially large incompatibility issue: fields containing timestamp string were previously typed as strings and therefore had all string operations and functions available to them, while if they are parsed into actual timestamps that will no longer be the case. The important exception here is the parse_timestamp string function, which will already no-op if the value is a timestamp and therefore is backwards compatible.

A smaller incompatibility issue is that timestamps strings would no longer be guaranteed to be passed through Vector with the exact same representation. When parsed at a source, the timestamps would be normalized to UTC and stored as a native representation, and then serialized again to a string at a sink. The value will still be "correct", but it may be surprising for users to see the values modified in any way.

It's worth noting that both of these apply in relatively isolated circumstances: when a string value in the body of a JSON message being parsed into a log event exactly matches the RFC3339 timestamp representation. Any other format will not be affected, timestamps on metrics are not affected, etc.

Alternatives

The problem that we're trying to solve here is that timestamps are a gap in the native JSON codec. The goal of native codecs is efficient, lossless representation of Vector's data model, but JSON's lack of timestamps in its own data model make those fields fall back to strings when we don't have a way to know they should be timestamps (e.g. metric timestamps are fine, but timestamps in log fields are ambiguous). This means that solutions involving configuration of specific fields to be interpreted as timestamps don't quite work in this case (/cc @jeromekleinen), even if they could be useful for other cases.

With that in mind, there are a few ways to go about it:

Optimistically parse all string values

This is what's implemented in #12252. We adjust the deserialization of the Value type to try to parse all strings as RFC3339 timestamps, falling back to returning them as strings.

Pros:

Simplest implementation
Could be considered a convenient feature more broadly
Soak tests show no noticeable performance impact

Cons

Changes the behavior of more than just the native codecs
Behavior changes are at least somewhat backwards-incompatible (see above)
Still just a heuristic, so fields that were legitimately strings can turn into timestamps
No way for users to opt out of this behavior which could feel too clever

Parse all strings only in the native codec

This would be similar to the above, except that we would need to duplicate much of Value's deserialization code and then apply the above changes to the new version. There would need to be some indirection (we can't have two Deserialize implementations for the same type), and we'd need to maintain two mostly-identical versions of the same logic.

Pros:

Limits impact to just the native codec
Should have similarly small performance impact (same as above plus layer of indirection)

Cons:

Still just a heuristic, so not 100% accurate
Duplicate implementation could be a small maintenance headache

Walk the event after parsing

A third option would be to leave the deserialization logic unchanged, and simply add a pass to the native codec that goes over each field in the event after parsing to look for strings that match the timestamp format.

Pros:

Limits impact to native codec
Less redundant implementation

Cons:

Performance of an after-the-fact recursive scan could be disqualifying
Still just a heuristic

Include timestamp field pointers in serialized representation

The only non-heuristic option would be to directly work around JSON's lack of timestamps by including some out-of-band information in our serialized representation that indicate which fields should be converted. This could looks something like:

{
    log: {
        message: "testing",
        foo: "1985-04-12T23:20:50.52Z",
        not_ts: "1985-04-12T23:20:50.52Z",
        bar: { baz: "1985-04-12T23:20:50.52Z" }
    },
    timestamp_fields: ["foo", "bar.baz"]
}

Pros:

Actually allows perfect accuracy in conveying proper types

Cons:

Implementation could be quite complex and/or require duplicated serializer
Not very space-efficient
Unknown performance impact
Somewhat strange JSON schema

Do nothing

It's always an option 😄

Pros:

Zero implementation or performance cost, just documentation
Very simple

Cons:

Timestamps will still become strings when transferred via native_json encoding

bruceg commented 2 years ago

I'll add another alternative:

Use special encoding for timestamps

This would put some in-band data into the timestamp that is unlikely to show up in normal data. This is effectively a modified form of the "optimistic" alternative with a non-traditional encoding of the data. Some examples:

"timestamp": "VECTOR-TIMESTAMP: 1985-04-12T23:20:50.52Z"
"timestamp": "VECTOR-TIMESTAMP: MTk4NS0wNC0xMlQyMzoyMDo1MC41Mlo="
"timestamp": { "VECTOR-TIMESTAMP": "1985-04-12T23:20:50.52Z" }
"timestamp": { "VECTOR-TYPE": "timestamp", "timestamp": "1985-04-12T23:20:50.52Z" }

Pros:

Most of the benefits of the optimistic alternative, at least for the pure string formatting.
Less backwards-incompatible than the optimistic alternative.
Extensible to other data types in the future if we need to encode other values that don't have a direct JSON representation.
Allows for accuracy in conveying proper types (modulo the unlikely chance of hitting the additional tags)

Cons:

More complexity than the optimistic variant for the structure variants, and likely lower performance.

I would recommend staying away from the pure "optimistic" solution, leaning towards a separate parser for the native codec. I would think that could mostly be a slim wrapper for the default implementation, with the one hook for strings applied.

I do wonder, though, if this is going to be a problem with more data types in the future than just timestamps.

JeanMertz commented 2 years ago

fields containing timestamp string were previously typed as strings and therefore had all string operations and functions available to them, while if they are parsed into actual timestamps that will no longer be the case.

While true, it's worth noting that to_string can convert a timestamp back to a string format, so there's a way "back" to the old behaviour, but it does require extra steps by the operator.

Include timestamp field pointers in serialized representation

It's worth noting that we could look into using the semantic meaning feature to assign timestamp meaning to field(s), which can then be used to know which fields to convert. I'm not yet sure how this would work exactly, but it at least potentially solves the "Implementation could be quite complex and/or require duplicated serializer" downside, as part of the scaffolding is already there.

jszwedko commented 2 years ago

We ended up deciding to ship without timestamp parsing for this initial version of the native_json codec, but we can keep this issue open to follow-up later, and as a place for users to leave thoughts if they run into its lack.

vectordotdev / vector