BharatKJain commented 1 month ago

Pull request

Description

Changed message-id from auto-increment ID to randomized ID in postprocessor/gelf_chunking.rs

HELP NEEDED: I have avoided adding hostname while producing message-id, I am not sure how can we handle adding hostname, please suggest.

Checklist

[ ] The RFC, if required, has been submitted and approved
[ ] Any user-facing impact of the changes is reflected in docs.tremor.rs
[x] The code is tested
[ ] Use of unsafe code is reasoned about in a comment
[x] Update CHANGELOG.md appropriately, recording any changes, bug fixes, or other observable changes in behavior
[ ] The performance impact of the change is measured (see below)

Performance

codecov[bot] commented 1 month ago

Codecov Report

Attention: Patch coverage is 97.54098% with 3 lines in your changes missing coverage. Please review.

Project coverage is 91.34%. Comparing base (f1f2b7f) to head (116a48f). Report is 8 commits behind head on main.

Files with missing lines	Patch %	Lines
...mor-interceptor/src/postprocessor/gelf_chunking.rs	97.54%	3 Missing :warning:

Additional details and impacted files

[![Impacted file tree graph](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/graphs/tree.svg?width=650&height=150&src=pr&token=d1bhuZGcOK&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs)](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) ```diff @@ Coverage Diff @@ ## main #2662 +/- ## ========================================== + Coverage 91.27% 91.34% +0.07% ========================================== Files 308 308 Lines 60116 60229 +113 ========================================== + Hits 54868 55016 +148 + Misses 5248 5213 -35 ``` | [Flag](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flags&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | Coverage Δ | | |---|---|---| | [e2e-command](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `11.27% <0.00%> (-0.02%)` | :arrow_down: | | [e2e-integration](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `50.52% <0.00%> (+0.09%)` | :arrow_up: | | [e2e-unit](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `12.56% <0.00%> (-0.02%)` | :arrow_down: | | [e2etests](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `52.85% <0.00%> (+0.09%)` | :arrow_up: | | [tremorapi](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `14.50% <0.00%> (-0.03%)` | :arrow_down: | | [tremorcodec](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `63.11% <ø> (ø)` | | | [tremorcommon](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `63.04% <ø> (ø)` | | | [tremorconnectors](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `28.81% <0.00%> (-0.05%)` | :arrow_down: | | [tremorconnectorsaws](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `11.25% <0.00%> (-0.03%)` | :arrow_down: | | [tremorconnectorsazure](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `4.68% <0.00%> (-0.02%)` | :arrow_down: | | [tremorconnectorsgcp](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `25.36% <0.00%> (+0.05%)` | :arrow_up: | | [tremorconnectorsobjectstorage](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `0.06% <ø> (ø)` | | | [tremorconnectorsotel](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `12.55% <0.00%> (-0.03%)` | :arrow_down: | | [tremorconnectorstesthelpers](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `68.25% <ø> (ø)` | | | [tremorinflux](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `87.71% <ø> (ø)` | | | [tremorinterceptor](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `55.34% <97.54%> (+0.98%)` | :arrow_up: | | [tremorpipeline](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `31.15% <ø> (ø)` | | | [tremorruntime](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `47.20% <0.00%> (-0.05%)` | :arrow_down: | | [tremorscript](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `55.06% <ø> (ø)` | | | [tremorsystem](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `5.78% <ø> (ø)` | | | [tremorvalue](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `69.52% <ø> (-0.04%)` | :arrow_down: | | [unittests](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/flags?src=pr&el=flag&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | `89.20% <97.54%> (+0.07%)` | :arrow_up: | Flags with carried forward coverage won't be shown. [Click here](https://docs.codecov.io/docs/carryforward-flags?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs#carryforward-flags-in-the-pull-request-comment) to find out more. | [Files with missing lines](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662?dropdown=coverage&src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) | Coverage Δ | | |---|---|---| | [...mor-interceptor/src/postprocessor/gelf\_chunking.rs](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662?src=pr&el=tree&filepath=tremor-interceptor%2Fsrc%2Fpostprocessor%2Fgelf_chunking.rs&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs#diff-dHJlbW9yLWludGVyY2VwdG9yL3NyYy9wb3N0cHJvY2Vzc29yL2dlbGZfY2h1bmtpbmcucnM=) | `96.00% <97.54%> (+2.89%)` | :arrow_up: | ... and [10 files with indirect coverage changes](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) ------ [Continue to review full report in Codecov by Sentry](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662?dropdown=coverage&src=pr&el=continue&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs). > **Legend** - [Click here to learn more](https://docs.codecov.io/docs/codecov-delta?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs) > `Δ = absolute (impact)`, `ø = not affected`, `? = missing data` > Powered by [Codecov](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662?dropdown=coverage&src=pr&el=footer&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs). Last update [f1f2b7f...116a48f](https://app.codecov.io/gh/tremor-rs/tremor-runtime/pull/2662?dropdown=coverage&src=pr&el=lastupdated&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs). Read the [comment docs](https://docs.codecov.io/docs/pull-request-comments?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=tremor-rs).

BharatKJain commented 1 month ago

Okay, will do.
Trying to think-out-loud, so let's say if we create a hash of a message but repetitive logs will create same message-id which is a problem because message-id has to be unique in nature, collisions can cause problems when we're decoding the message on the server side. I am not completely sure but ideally we would want to have uniqueness in the message-id to make sure that we are not breaking the server GELF-decoding.

(How it will break server-side due to collision? So message-id is a way of determining if the UDP packet is associated with already existing log or it's for a new log, when we're sending same message-id for multiple logs then server behaviour will be to merge the data-together which will end-up breaking the log)

TBH I am also trying to figure this out, please share any suggestions, am I thinking right? 😅

Licenser commented 1 month ago

Ja just the message content would not work, I'm still considering if message content + ingest_ns (nanosecond when the message was registered at tremor) would be enough, if a server produces the same log twice in the same nanosecond that'd be very odd (but not impossible) OTOH having two random generated numbers be the same is also odd (but not impossible) it would also one a more deterministic failure case "When messages with the same content arrive at exactly the same time they will get duplicated message ids" instead of "if the RNG hates you, you'll get duplicated message ids"

Licenser commented 1 month ago

Sorry for all the forth and back, so I've been thinking about a good way to make this:

1) fast 2) deterministic 3) non-coliding

and wanted to throw out a suggestion using the ingest_ns + an incremental id as a message ID. This is:

1) fast since we just have to look at two integers, no RNG, no hashing 2) deterministic since the ingest_ns is settable, and the incremental counter is well, incremental so determinstic as well 3) it avoids duplicates by not creating duplicates on the same system due to incremtal id's and makes them extremely unlikely on multiple systems due to nano second timestamps involved.

BharatKJain commented 1 month ago

Sorry for all the forth and back, so I've been thinking about a good way to make this:
1. fast

2. deterministic

3. non-coliding
and wanted to throw out a suggestion using the ingest_ns + an incremental id as a message ID. This is:
1. fast since we just have to look at two integers, no RNG, no hashing

2. deterministic since the ingest_ns is settable, and the incremental counter is well, incremental so determinstic as well

3. it avoids duplicates by not creating duplicates on the same system due to incremtal id's and makes them extremely unlikely on multiple systems due to nano second timestamps involved.

Can I keep thread-id just to reduce the collision probability more? 😅

BharatKJain commented 1 month ago

Okay, I have made the changes as suggested in the latest comment.

(epoch_timestamp & BITS_13) | (auto_increment_id & !BITS_13) | (thread_id_u64 & BITS_13)

(epoch_timestamp is ingest_ns when process() is called, in finish() there's no ingest_ns so I am using current timestamp.)

Added a unit-test for it.
Renamed id to auto_increment_id

Let me know your thoughts!

Licenser commented 1 month ago

I would remove the thread ID. It is breaking the distribution and making the date less unique also it prevents event id's to be replayable. Basiclly it will result in the lower 13 bit to ave 3 times as many 1's as 0's making them more likely to collide while also making the id not determinstic.

The second problem I spot is that by removing the lower 13 bit from increment, that part will have no effect for the first 8192 messages and then only change every 8192 messages. My suggestion would be to shift it by 13 bits instead of truncating them.

Lastly I'd probably pull in a few more bits form the timestap, 13 bit are only 0,0000082 secs even if you bump it to 16 that means you get a window of 0,000066 secs + a 48bit counter (or 281.474.976.710.656 events to count)

You'd end up with something like this:


(epoch_timestamp & 0xFF_FF) | (auto_increment_id << 16 )
 ``

 do you think that would solve the original problem?

BharatKJain commented 1 month ago

I will fix the clippy checks and DCO, please allow me sometime.

Honestly I didn't anticipate this to happen but I also created otel gelf-exporter. :)

Opentelemetry PR for gelf-exporter (WIP)

BharatKJain commented 1 month ago

Line 190:

let current_epoch_timestamp = u64::try_from(SystemTime::now().duration_since(UNIX_EPOCH).expect("SystemTime before UNIX EPOCH!").as_nanos())?;

Not sure how to handle this better, is this okay?

Licenser commented 1 month ago

We'd want to avoid the expect. Let me explain why.

expect and unwrap cause a crash, in some applications that's okay since, say you have a CLI if something goes wrong you might just want to error out and restart from scratch. Now tremor is a bit more complex there might be more then one pipeline running so we don't want one pipeline affect the progress of another. If we'd crash in pipeline A we'd take pipeline B down with it - that's not desirable. So we handle errors and let them bubble up so only the pipeline causing the issue is affected.

To do that there are a few ways ? in many places work, but if the error type can't be mapped something like map_err(...)? is a good alternative.

BharatKJain commented 1 month ago

@Licenser @darach Does it look good now?

(I have fixed the DCO, format, clippy-check, code quality checks)

darach commented 1 month ago

@BharatKJain LGTM now. I think you just need to click on resolve conversation for the open review comment and we're all good!

tremor-rs / tremor-runtime

Fix: message-id in postprocessor/gelf-chunking #2662

Pull request

Description

Related

Checklist

Performance

Codecov Report