vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.69k stars 1.56k forks source link

Kafka sink failure with 0.35.0 #19627

Closed michMartineau closed 7 months ago

michMartineau commented 9 months ago

A note for the community

Problem

We've upgraded our vector pods from v0.30.0 (chart 0.26.0) to 0.35.0 (chart: 0.30.0). Vector fails and the pod restart (crash loop backoff).

It seems related to the kafka sink (see related logs). I've tested the 0.34.2 (chart: 0.29.3) successfully.

Configuration

---
log_schema:
  source_type: x-source_type
  timestamp_key: '@timestamp'

timezone: Etc/UTC

sources:
  gateway:
    type: logstash
    address: 0.0.0.0:5044

transforms:
  remap_source_gateway:
    type: remap
    inputs: [gateway]
    source: |
      %kafka_topic = "systemlogs"

sinks:
  kafka:
    type: kafka
    inputs: [remap_source_gateway]
    bootstrap_servers: ${KAFKA_HOST}
    topic: '{{ %kafka_topic }}'
    compression: none
    encoding:
      codec: json
    healthcheck:
      enabled: false

Version

0.35.0

Debug Output

2024-01-15T15:01:48.241263Z DEBUG source{component_kind="source" component_id=gateway component_type=logstash}:connection{peer_addr=10.89.18.95:56433}: vector::sources::util::net::tcp: Connection closed.
thread 'vector-worker' panicked at 'called `Option::unwrap()` on a `None` value', /cargo/registry/src/index.crates.io-6f17d22bba15001f/lockfree-object-pool-0.1.4/src/linear_page.rs:48:30
stack backtrace:
   0:          0x128a36c - std::backtrace_rs::backtrace::libunwind::trace::hc7cb25c924fb6eba
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:          0x128a36c - std::backtrace_rs::backtrace::trace_unsynchronized::h5c4ce001d8eb6854
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:          0x128a36c - std::sys_common::backtrace::_print_fmt::h14d9ae8d6479967a
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:65:5
   3:          0x128a36c - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h02c30a315588511a
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:44:22
   4:           0xa13948 - core::fmt::rt::Argument::fmt::hd5ccf3711fc53a4f
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/rt.rs:138:9
   5:           0xa13948 - core::fmt::write::h2876fd52ca55dbfa
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/fmt/mod.rs:1094:21
   6:          0x12523e4 - std::io::Write::write_fmt::h3391f6b02b293028
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/io/mod.rs:1714:15
   7:          0x128cfc4 - std::sys_common::backtrace::_print::h23980dcdfc2fe28a
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:47:5
   8:          0x128cfc4 - std::sys_common::backtrace::print::h49b97a66db4af001
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:34:9
   9:          0x128cbcc - std::panicking::default_hook::{{closure}}::h7138645749f92527
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:269:22
  10:          0x128db04 - std::panicking::default_hook::h691846cb3f034db3
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:288:9
  11:          0x128db04 - std::panicking::rust_panic_with_hook::hb3c04ea23b04721e
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:705:13
  12:          0x128d604 - std::panicking::begin_panic_handler::{{closure}}::hc4edb1e6fd5a6b0f
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:595:13
  13:          0x128d59c - std::sys_common::backtrace::__rust_end_short_backtrace::hb6a6277e20eb73e1
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys_common/backtrace.rs:151:18
  14:          0x128d590 - rust_begin_unwind
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/panicking.rs:593:5
  15:           0x424870 - core::panicking::panic_fmt::hfcaa20ee741d1cc5
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/panicking.rs:67:14
  16:           0x42498c - core::panicking::panic::h31e7a12887ab7d97
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/core/src/panicking.rs:117:5
  17:          0x3348b64 - <tracing_subscriber::layer::layered::Layered<L,S> as tracing_core::subscriber::Subscriber>::new_span::h9a00643ebd4f75b8
  18:          0x334b124 - <tracing_subscriber::layer::layered::Layered<L,S> as tracing_core::subscriber::Subscriber>::new_span::h2c1f9010dda681f0
  19:          0x13e1128 - tracing::span::Span::make_with::h0946854b8046d955
  20:          0x1c5eda4 - <vector::sinks::kafka::sink::KafkaSink as vector_core::sink::StreamSink<vector_core::event::Event>>::run::{{closure}}::h21f35518df6934ca
  21:          0x1c5c8e4 - <vector_core::sink::EventStream<T> as vector_core::sink::StreamSink<vector_core::event::array::EventArray>>::run::{{closure}}::ha31ca5ad7a9a3402
  22:          0x320ecc8 - vector::topology::builder::Builder::build_sinks::{{closure}}::{{closure}}::h03ab451c62056c46
  23:          0x31c5948 - <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll::heb356ec0acaec9ef
  24:          0x31c53bc - tokio::runtime::task::raw::poll::hd3c4a0678998574d
  25:          0x12d2128 - tokio::runtime::scheduler::multi_thread::worker::Context::run_task::hb43a1ab74e3daf92
  26:          0x12d78e4 - tokio::runtime::task::raw::poll::h9fd75a3a647deed6
  27:          0x12bba50 - std::sys_common::backtrace::__rust_begin_short_backtrace::h2b66fe48017742c2
  28:          0x12bb690 - core::ops::function::FnOnce::call_once{{vtable.shim}}::hedd6b07b50b25d09
  29:          0x129069c - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::h407d789be6f3d62c
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:1993:9
  30:          0x129069c - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::hf45fc529abdd7c00
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/alloc/src/boxed.rs:1993:9
  31:          0x129069c - std::sys::unix::thread::Thread::new::thread_start::hcd656e077ad34229
                               at /rustc/d5c2e9c342b358556da91d61ed4133f6f50fc0c3/library/std/src/sys/unix/thread.rs:108:17
2024-01-15T15:01:48.333995Z DEBUG source{component_kind="source" component_id=gateway component_type=logstash}:connection{peer_addr=10.89.54.188:36347}: vector::sources::util::net::tcp: Accepted a new connection. peer_addr=10.89.54.188:36347
2024-01-15T15:01:48.340134Z DEBUG source{component_kind="source" component_id=gateway component_type=logstash}:connection{peer_addr=10.89.54.188:36347}: vector::sources::util::net::tcp: Connection closed.
2024-01-15T15:01:48.453506Z ERROR sink{component_kind="sink" component_id=kafka component_type=kafka}: vector::topology: An error occurred that Vector couldn't handle: the task panicked and was aborted.
2024-01-15T15:01:48.453568Z  INFO vector: Vector has stopped.
2024-01-15T15:01:48.453640Z DEBUG source{component_kind="source" component_id=gateway component_type=logstash}:connection{peer_addr=10.89.37.245:7823}: vector::sources::util::net::tcp: Start graceful shutdown.
2024-01-15T15:01:48.453652Z DEBUG source{component_kind="source" component_id=gateway component_type=logstash}:connection{peer_addr=10.89.62.209:14033}: vector::sources::util::net::tcp: Start graceful shutdown.
2024-01-15T15:01:48.453739Z DEBUG source{component_kind="source" component_id=gateway component_type=logstash}: vector::topology::builder: Source finished normally.
2024-01-15T15:01:48.454775Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="remap_source_gateway, gateway" time_remaining="59 seconds left"
2024-01-15T15:01:49.421134Z ERROR source{component_kind="source" component_id=gateway component_type=logstash}:connection{peer_addr=10.89.37.245:7823}: vector::internal_events::tcp: Error writing acknowledgement, dropping connection. error=Broken pipe (os error 32) error_code="ack_failed" error_type="writer_failed" stage="sending" internal_log_rate_limit=true
2024-01-15T15:01:49.421383Z DEBUG transform{component_kind="transform" component_id=remap_source_gateway component_type=remap}: vector::topology::builder: Synchronous transform finished with an error.
2024-01-15T15:01:49.421399Z ERROR transform{component_kind="transform" component_id=remap_source_gateway component_type=remap}: vector::topology: An error occurred that Vector couldn't handle: receiver disconnected.
2024-01-15T15:01:53.454051Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gateway" time_remaining="54 seconds left"
2024-01-15T15:01:58.454189Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gateway" time_remaining="49 seconds left"
2024-01-15T15:02:03.454918Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gateway" time_remaining="44 seconds left"
2024-01-15T15:02:08.454973Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gateway" time_remaining="39 seconds left"
2024-01-15T15:02:13.454684Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gateway" time_remaining="34 seconds left"
2024-01-15T15:02:18.454653Z  INFO vector::topology::running: Shutting down... Waiting on running components. remaining_components="gateway" time_remaining="29 seconds left"
2024-01-15T15:02:18.454660Z  INFO source{component_kind="source" component_id=gateway component_type=logstash}:connection{peer_addr=10.89.62.209:14033}: vector::sources::util::net::tcp: Resetting connection (still open after seconds). seconds=30s
2024-01-15T15:02:18.454729Z DEBUG source{component_kind="source" component_id=gateway component_type=logstash}: vector::topology::builder: Source pump finished normally.
2024-01-15T15:02:18.454758Z DEBUG source{component_kind="source" component_id=gateway component_type=logstash}: vector::topology::builder: Source pump supervisor task finished normally.

Example Data

No response

Additional Context

No response

References

No response

jszwedko commented 8 months ago

Hi @michMartineau !

Apologies for the delay in reproducing this. I tried today but unfortunately wasn't able to reproduce this issue using a local Kafka instance and Logstash to send data. Do you run into it consistently? If so, could you try enabling Kafka debug logging to try to troubleshoot? This can be done by doing both of the following:

michMartineau commented 8 months ago

Hi @jszwedko, Thank for the tips, I will test that today. In our non production environments, the issue appears randomly. In our production environments, where there are more activities, the issue was more constant. So we rollbacked the deployment for them. I didn't find a root cause.

michMartineau commented 8 months ago

Hi @jszwedko I've reproduced the issue with the configuration you have given to me. I've also added the env variable RUST_BACKTRACE: full here the log file: vector-kafka-sink-issue-20240207.log

michMartineau commented 8 months ago

I wonder if it is not more an issue with internal tracing than a kafka issue.

michMartineau commented 8 months ago

Just a remark VECTOR_LOG="librdkafka=trace,rdkafka::client=debug" seems not a valid log level configuration. So the log file above contains only results from librdkafka_options.debug = "all"

jszwedko commented 8 months ago

Thanks for that @michMartineau ! That is interesting. To confirm, you are only seeing this issue on 0.35.0 and never saw it on 0.34.2?

Assuming that's true, two changes jump out to me that were in the 0.35.0 release:

The tracing changes seem like a more likely culprit based on the backtrace, but I'm not seeing any smoking guns.

michMartineau commented 8 months ago

Indeed i've never seen this issue with 0.34.2.

jszwedko commented 8 months ago

Indeed i've never seen this issue with 0.34.2.

Would you be willing to try out a custom image? I can create one that reverts the tracing changes to see if that fixes it for you.

michMartineau commented 8 months ago

Good idea, I can try.

jszwedko commented 8 months ago

@michMartineau I published a custom build that is v0.35.0 with the kafka tracing changes reverted. Could you try one of these images and let me know how it goes?

https://hub.docker.com/r/timberio/vector/tags?page=1&name=0.35.0.custom.059fb1b

michMartineau commented 8 months ago

Hi @jszwedko , I've tested it and I was wrong. It failed. My bad. Last time, I tested 0.34.2 it was ok. but as it seems related to activity load. So I've tested all versions from 0.33.0 to 0.35

Chart Binary Result
0.30.0 0.35.0 KO
0.29.1 0.34.2 KO
0.29.0 0.34.1 KO
0.28.0 0.34.0 KO
0.27.0 0.33.1 KO
0.26.0 0.33.0 OK

So it seems due to a change with 0.33.1 here logs with the version 0.33.1 vector-0.33.1.log

michMartineau commented 8 months ago

I wonder if it could be related to this commit https://github.com/vectordotdev/vector/commit/fa09de37c735bec57a67d78641b9db13c17097d8

jszwedko commented 8 months ago

Thanks for doing that analysis @michMartineau ! Let me spin up a custom build that is v0.33.1 without that commit and we can see how that looks. I agree it seems like the most suspicious.

jszwedko commented 8 months ago

Pushed another custom build. Can you try one of https://hub.docker.com/r/timberio/vector/tags?page=1&name=0.33.1.custom.a1b6913 ?

michMartineau commented 8 months ago

Yes sure. Thx again, I can't today. but I will do next Monday

michMartineau commented 8 months ago

Hi, I've tested several time the version 0.33.1 and the custom version 0.33.1 without this commit.

jszwedko commented 8 months ago

Thanks for testing! We'll take a closer look at that commit and see if anything obviously jumps out.

yjagdale commented 7 months ago

looks like this has been shown up more frequently now, Thanks for pointing to the exact commit. awaiting more findings @jszwedko . Thanks @michMartineau !

ninjulkar commented 7 months ago

any luck with identifying and fixing this bug?

yjagdale commented 7 months ago

it seems to be during rebalancing or while some change at kafka which trigger change in vector.(this is just guess based on closer debug on the issue. I may be wrong as well but this is inital understanding)

jszwedko commented 7 months ago

We think we may have found the issue, fixing it in https://github.com/vectordotdev/vector/pull/20001. I created another set of custom docker images to try: https://hub.docker.com/r/timberio/vector/tags?page=1&name=0.35.1.custom.34bc8a. @michMartineau (or others) do you mind giving those a try and letting us know if they fix the issue for you?

michMartineau commented 7 months ago

@jszwedko sure I will test that.

michMartineau commented 7 months ago

@jszwedko I did some tests yesterday and today. I've switched between the 0.36.0 and the custom version. With 0.36.0-alpine, I saw some restarts. nothing with custom version. so look good to me

jszwedko commented 7 months ago

@jszwedko I did some tests yesterday and today. I've switched between the 0.36.0 and the custom version. With 0.36.0-alpine, I saw some restarts. nothing with custom version. so look good to me

Thanks for verifying!