vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
17.42k stars 1.51k forks source link

New `aws_kinesis_streams` source #33

Open zsherman opened 5 years ago

zsherman commented 5 years ago

It would be nice if Vector could ingest logs from a AWS Kinesis data stream (not Firehose which is covered in #3566).

Requirements

sam701 commented 4 years ago

@binarylogic How would you recommend to handle this scenario in spring 2020? Having a lambda + vector with HTTP source? I saw, the HTTP source is not production ready yet.

binarylogic commented 4 years ago

Hi @sam701, this source is a little more tricky since:

  1. Vector must implement continuous shard discovery.
  2. Vector must handle various shard states (ex: a closed stream).
  3. Vector must coordinate with other Vector instances to ensure only one Vector instance is reading a shard. (distributed locking).
  4. Vector must maintain checkpoint state.

Kafka, for example, handles a lot of this bookkeeping, making the integration much easier. That's why this source is not done yet. It's questionable if all of the above fits within the scope of Vector. Especially stream exclusivity, which would require distributed locking. That would obviously need to be delegated to a system designed for that.

The best solution, imo, is the wrap Vector in a system that handles the above. Which is easier said than done. But, for example, if Vector was integrated into an AWS Lambda function you could leverage AWS' kinesis -> lambda integration, which handles all of this for you.

joseluisjimenez1 commented 5 months ago

Any news on this?

RichardoC commented 5 months ago

I'd also be appreciative of this feature being added. In terms of the complexity, logstash already supports this so perhaps its solutions to the locking problem/etc can be reused? https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kinesis.html

joseluisjimenez1 commented 5 months ago

I'd also be appreciative of this feature being added. In terms of the complexity, logstash already supports this so perhaps its solutions to the locking problem/etc can be reused? https://www.elastic.co/guide/en/logstash/current/plugins-inputs-kinesis.html

Looks like logstash is able to support cause is using the aws client, but there is no rust one at this moment: https://github.com/awslabs?q=kinesis-client&type=all&language=&sort=

I guess that we can call some java or python code from rust but I seems bad to me.

RichardoC commented 5 months ago

There is https://crates.io/crates/aws-sdk-kinesis linked to from https://awslabs.github.io/aws-sdk-rust/ so perhaps that have most of what's needed?

nikolay-te commented 5 months ago

There is a client for Rust, already being used by the kinesis sink, but the issue is in the complexity of synchronization and shard discovery. The Logstash input uses the AWS KCL that needs a DynamoDB table (e.g. similar to how Kafka uses zookeeper for state). @binarylogic's comment explains well the complexity behind this.

I think however another option could be to run the AWS Java KCL MultiLangDaemon and pipe the output to a Vector stdin source.