vectordotdev / vector

A high-performance observability data pipeline.
https://vector.dev
Mozilla Public License 2.0
18.03k stars 1.59k forks source link

Support ETL use-cases in Vector #11095

Open jszwedko opened 2 years ago

jszwedko commented 2 years ago

Community Note


We've had a number of different requests to support ETL-like use-cases in Vector and so I figured it'd be useful to create this issue to track them all in one place.

Currently Vector is architected for stream processing and doesn't support ETL execution very well. This is primarily due to the lack of source support for bulk execution where the source shuts down after all input has been processed.

Users have asked for this functionality for the file source and the aws_s3 source, but it is easy to see that it could be desirable for any archive-like source. It could even be useful for sources like kafka where it would drain a topic and then shut down.

Refs:

jszwedko commented 2 years ago

There is a implementation of batch handling in the file source here: https://github.com/vectordotdev/vector/pull/11667

davidjericho commented 1 year ago

I presently solve this using stdin and some excessive cat.

(cat logfile; kill -s TERM 0) | ./vector allows me to run a metrics source concurrently while processing the logfile, and exiting vector once the logfile has been completely handled. If you do this inside a bash script, you need to set -m first.

obourdon commented 11 months ago

As suggested in some of the links referenced above, I have tried the remove_after_secs: 0 to my file source thinking that may be the use of inotify would induce some behaviour but this did not make any change in the behaviour, I had to Ctrl-C vector to end the process after all files are processed and removed :-(

prein commented 1 month ago

I have a use case with http_client source. No clear "end" but maybe a timeout can be considered?

leandrojmp commented 1 week ago

@jszwedko

Hello, is this still being considered?

We have exactly this need, we are running vector inside a pod where another container will get some data from some API endpoint and write it into json files, vector is running in another container reading these files and shipping them to kafka or some http endpoint.

After the container running the collector script terminates, the container with vector keeps running, we are now looking into how we can kill this container, but would be nice if vector supported this natively.

jszwedko commented 1 week ago

I think we are still open to it, but no concrete timeline.