Delta Lake Source (eventually sink)

rwaweber commented 3 years ago

Current Vector Version

0.11.1 -- sorta N/A

Use-cases

Delta lake is a "storage layer that provides ACID guarantees for Apache Spark and Big Data workloads." Typically those big data workloads are hosted on cloud storage systems(s3, azure blob store) or hadoop.

Using vector as a means to read(and eventually write, when its supported in their rust bindings) from delta provides a compelling way to read information with their (I know its very scala-heavy, bear with me!)streaming API. In this form, these events can be used in nearly the same way as other queueing systems(kafka, kinesis, sqs, etc.).

Attempted Solutions

N/a

Proposal

It would be super exciting to have streaming integration with vector and delta. While the documentation seems very scala/spark-heavy, they do have rust bindings that appear to be in active development(though marked experimental).

From my cursory pass at some of the samples I think the bindings may be a little unstable at the time of writing to use, but I mostly wanted to get the idea down somewhere. Something to circle back to once the delta rust bindings stabilize a bit more, let me know what you all think!

References

lucperkins commented 3 years ago

@rwaweber It definitely does seem like an intriguing platform and we're always game to consider new sources and sinks. What I think we should make clear at the outset, though, is that Vector is 100% focused on the observability sphere and thus on sources/sinks that are commonly used for logs and metrics. In order for us to consider adding an integration for Delta Lake, we'd need to see traction in that specific domain. But if people are indeed using Delta Lake for that, we're all ears.

rwaweber commented 3 years ago

Hey @lucperkins! Thanks for the reply -- sorry it took me so long to get around to responding!

I think I see what you're getting at, but feel free to point something out if I'm misunderstanding:

Since deltalake is largely a tool of the big data variety and is for more of a "general purpose" type of tool rather than an observability-specific tool, it makes it kind of hard to pin down. Is that right?

I think I get that, as it would be somewhat similar to build something like a hadoop source, or more generally a RDBMS-type of source, right? I think as a source it definitely makes sense just given the ambiguous structure of what origin data could look like, which could definitely make it hard to define a consistent event schema.

Though as a sink, I do see it being a little easier to flesh out, just more or less as a general purpose data warehouse, sort of along the same vein as clickhouse or elasticsearch in some cases.

Today, we're writing events out to s3 which are then read by spark, and serialized into delta tables as part of a scheduled job.

Most of my motivation in exploring delta as a vector sink is to be able to remove that stage of our pipeline, since that stage is mostly just an operation of reading/uncompressing archives from S3 and casting newline-delimited JSON objects to rows in Delta.

But, on the other hand I definitely understand avoiding bloat for a sink that isn't terribly popular! I wouldn't be opposed to considering this as a sink WASM module at some point in the future FWIW!

vectordotdev / vector