Consider Vector as a library

leebenson commented 3 years ago

I think there's a huge opportunity to explore in offering Vector as a library. Not just for Rust, but as a first-class library for a range of languages, via native FFI.

Right now, Vector serves its agent role well. It's a go-between for source(s) and sink(s). It's fast, efficient, and can be deployed almost everywhere. But in the current paradigm, it's not a source itself. Something else fills the source. It's Vector's job to pick it up, transform it, and ship it.

By offering Vector as a library, a language/framework such as Node.js or Python could run Vector directly. Via a native API, they could configure transforms and sinks in code, and ship data by just calling (in this case) a regular Javascript or Python function. No agent required, no set-up sys admin; it's a library that runs when the server is running, with the performance characteristics of (almost) pure Rust.

Here's an example of how that might look in Javascript:

import { Client, ConsoleSink } from "@timberio/vector";

const vector = new Client({
  transforms: [],
  sinks: [
    new ConsoleSink({
      format: "json",
    })
  ],
})

// Then to log some data
const ack = await vector.log(["One line", "Another line"]);

Roughly equivalent to the following vector.toml:

[sources.generator]
  type = "generator"
  lines = ["One line", "Another line"]

[sinks.my_sink_id]
  # General
  type = "console"
  inputs = ["generator"]
  target = "stdout"

  # Encoding
  encoding.codec = "json" # required

Similarly, Python could look almost 1:1:

from vector import Client, ConsoleSink

vector = Client(sinks=[ConsoleSink(format="json")])

vector.log(['One line', 'Another line'])

Each lib would be native to its platform, with an API that respects the conventions/idioms of its host, but is ultimately a facade to the common Rust lib that is called via FFI.

Potential benefits

Massively increase in reach for Vector, tapping into large developer communities in Javascript, Python, Ruby, etc.
Under the hood, it's still Rust - with the safety, performance and efficiency it brings.
Lock-step development. One code base to maintain for core functionality; each end-lib is essentially a facade.
Simple deployment model, via native package managers - e.g. npm i @timberio/vector for Node.js
No overhead in managing a separate agent-- when the main workload is alive, so is Vector.
Configurable in one place -- userland code -- leveraging existing 12-Factor or whatever other methodology/tooling they're configuring app code with.
Path into later observability tooling, that can aggregate stats from running apps by passing an API token in code.

Considerations

Is this something people want? If the user is me, yes -- but needs wider validation.
Need to research the tooling further. I've played with Node.js' FFI bindings for Rust briefly with a sample project, and it wound up being quite simple-- but a deeper dive into the tooling ecosystem would be required.
Compilation is trickier. Node often requires gcc / g++ for its popular C libs. But it's also common to DL binaries matching the host arch directly, skipping compilation altogether. This would mean precompiling against a combination of common OS/CPU/runtime configurations and hosting them, for each target lib. But it would massively speed up installation and to the end-user, would be comparable to just downloading regular JS (or equivalent) code. For an example where this is used with Rust specifically, see swc.
Does this work with Vector's ethos? Is application-level logging somewhere we want to go?
What are the parts of Vector it makes sense to expose via this approach? Would we need to drag the whole app along for the ride - or could a slimmer [features] set make this reasonably-sized for use in a lib?

pepoluan commented 3 years ago

I'd love to pump my logs directly from my Python program to Vector. Has someone created a logging handler for Vector?

Bigomby commented 3 years ago

I find this proposal quite interesting, but I have one question:

I think that similar tools like fluentd, fluentbit (and currently Vector ) need to be restarted in order to update some configuration. Will this feature allows to configure Vector at runtime?

Following your Node.js example, someone could go even further and use Express to manage sinks or transforms via API. Even if that involves tearing down the previous instance and creating a new one, it will be much convenient that stopping the agent, update the configuration file and then start the agent again.

leebenson commented 3 years ago

I think that similar tools like fluentd, fluentbit (and currently Vector ) need to be restarted in order to update some configuration. Will this feature allows to configure Vector at runtime?

The scenario described in this issue, is that the app itself is Vector. Its runtime is that of the application. There's no separate agent; it exists for as long as the app is running. Updating configuration in this scenario would just be a case of calling vector.updateConfig() (or similar) and the new config would apply immediately.

There's a second option, which is for Vector to exist independently, but to interface with it (possibly over gRPC or similar) by providing logs. In this way, the app is effectively just a source, shipping logs/event data to Vector. This could still be written in Rust and exposed via FFI-- this would theoretically still be more performant in many scenarios where processing of event data and/or transmission of it is CPU bound, but practically may be constrained more by I/O. This would be a much slimmer runtime since its only concern would be shipping data to a separate, running Vector instance, that exists outside of the app -- locally, or remote.

I think that's an interesting option too. One the primary motivations for considering Vector as living within an app as a library exposed via a foreign-function interface, is the lack of need to then configure/deploy it as a separate agent. I'm imagining a common use-case would be to spin up, say, a web app, who's only source may be the app itself, with maybe a transform or two, and 1-2 sinks.

Currently, this means deploying Vector as a standalone agent, and configuring a source that the app would itself need to dump out to. The scenario I'm describing in this issue effectively removes the need to deploy an agent and create a separate source. The app becomes both the agent and source.

Whether or not this is a viable or interesting option, remains to be seen. I'm very interested in hearing use-cases and gathering feedback.

Following your Node.js example, someone could go even further and use Express to manage sinks or transforms via API. Even if that involves tearing down the previous instance and creating a new one, it will be much convenient that stopping the agent, update the configuration file and then start the agent again.

Yes, this is exactly what I imagined. Since the app itself 'houses' a running agent via its runtime, it could be configured on-the-fly via a language native API. There needn't be any SIGHUP, or secondary processes to manage. If this were a Node.js or Python app, you'd just import the vector library, configure it through methods, and send event data directly to it.

wperron commented 3 years ago

Hey, I'm actually considering something very similar at the moment for Deno. Since we are effectively wrapping the V8 engine in Rust and already have the infrastructure in place to hook up extensions to V8 I figured that should be relatively easy, provided of course that I can import Vector as a Rust crate.

In this case I wouldn't need to go through any FFI or gRPC layer, just straight in-memory. I was thinking of starting with keeping the size of the dependency to a minimum by only importing the logs sinks, I believe that will already be possible by using cargo features.

Anyway, if there's interest there, just ping me in the Discord server.

leebenson commented 3 years ago

@wperron, we'd love to support your work with Deno. Interested to hear more about your intended use-case, and how we can help. Let's catch up on Discord, or ping me at lee.benson@datadoghq.com

avnerbarr commented 3 years ago

Is this similar in any way to this? https://vector.dev/guides/advanced/wasm-multiline/

jszwedko commented 3 years ago

@avnerbarr that is related, but in the other direction: letting Vector run arbitrary user code (compiled into WASM) for transforms, rather than letting user code run Vector code.

niyue commented 2 years ago

I am interested in this proposal, and posted a discussion thread here

In my use case, I would like to integrate vector into my C++ project because: 1) the sink connection information is known up front, and is provided by my users during runtime, so I need to use API to create sink dynamically 2) I would like to have finer granularity feedback by using the API, for example, after sending a new event to vector, I would like to know if it is successfully delivered to the sink

djrodgerspryor commented 2 years ago

I'm quite intersted in this. At Stile, we run lots of different services and are planning to switch to vector (in place of filebeat) to collect our syslogs on each host, but we also run a lot of little Rust binaries, both as services, in CI, and as CLIs on operator laptops etc.

It would be lovley for these little self-contained binaries to be able to directly log to the same sinks as our main application without us having to run some kind of gateway vector instance.

I'd be interested in conttributing to this if it's an accepted direction.

ambroserb3 commented 1 year ago

I'm also interested in this! I'm an MLOps Engineer and we have a lot of python tooling where we doing one off jobs with a lot of data in motion. Being able to pull in data from a source using vector, transform with python, and send to a sink has a lot of potential for logging model metrics, predictions, and other data during training and other ML jobs. I especially see value in adding this to our running inference services for easily logging predictions.

galah92 commented 1 year ago

Also interested in this. I'm working on an on-prem IoT solution and currently managing it with docker-compose and K3s. Vector runs there as well. It's fairly complex, and I'm thinking to simplify it to a single binary to ease deployments. Using Vector as a library will help here a lot.

marsupialtail commented 1 year ago

I am the developer of a new distributed dataframe library for time series ETL/analytics: https://github.com/marsupialtail/quokka.

I am considering adding support for VRL and stumbled upon this thread. If I get ten upvotes for this comment I will commit to developing Python bindings for Vector.

spencergilbert commented 1 year ago

@marsupialtail - it's worth noting that if you're interested in VRL specifically, we've recently pulled that into it's own repo with the aim to have it released as a single crate.

marsupialtail commented 1 year ago

This is amazing. I will be keeping track of it.

marsupialtail commented 1 year ago

I am going to currently try to integrate vector in my library by running it as a sidecar. I will explore the crate + python bindings once that happens.

Quick question @spencergilbert is there a way for you to tell what the "end schema" of a VRL definition will result in?

fuchsnj commented 1 year ago

Quick question @spencergilbert is there a way for you to tell what the "end schema" of a VRL definition will result in?

After compiling a VRL program, you will get a result that contains a Program struct. This has a final_type_state() function that gives you access to the type definitions.

https://github.com/vectordotdev/vrl/blob/650547870a16c66dcfab01ec382cfdc23415d85b/lib/compiler/src/program.rs#L26

gaby commented 2 months ago

Was any progress made on this proposal? Last comment was ~year ago. Using Vector as a library would be huge for languages like Python which already have a lot of libraries based on Rust (Ruff, Pydantic, OpenDAL, etc.)

vectordotdev / vector

Consider Vector as a library #4085

Potential benefits

Considerations