linkedin / venice

Venice, Derived Data Platform for Planet-Scale Workloads.
https://venicedb.org
BSD 2-Clause "Simplified" License
487 stars 84 forks source link
ai database hadoop kafka ml

Venice

Derived Data Platform for Planet-Scale Workloads

Stable Release CI Docs
GitHub LinkedIn Twitter Slack

Venice is a derived data storage platform, providing the following characteristics:

  1. High throughput asynchronous ingestion from batch and streaming sources (e.g. Hadoop and Samza).
  2. Low latency online reads via remote queries or in-process caching.
  3. Active-active replication between regions with CRDT-based conflict resolution.
  4. Multi-cluster support within each region with operator-driven cluster assignment.
  5. Multi-tenancy, horizontal scalability and elasticity within each cluster.

The above makes Venice particularly suitable as the stateful component backing a Feature Store, such as Feathr. AI applications feed the output of their ML training jobs into Venice and then query the data for use during online inference workloads.

Overview

Venice is a system which straddles the offline, nearline and online worlds, as illustrated below.

High Level Architecture Diagram

Write Path

The Venice write path can be broken down into three granularities: full dataset swap, insertion of many rows into an existing dataset, and updates of some columns of some rows. All three granularities are supported by Hadoop and Samza. In addition, any service can asynchronously produce single row inserts and updates as well, using the Online Producer library. The table below summarizes the write operations supported by each platform:

Hadoop Samza Any Service
Full dataset swap
Insertion of some rows into an existing dataset
Updates to some columns of some rows

Hybrid Stores

Moreover, the three granularities of write operations can all be mixed within a single dataset. A dataset which gets full dataset swaps in addition to row insertion or row updates is called hybrid.

As part of configuring a store to be hybrid, an important concept is the rewind time, which defines how far back should recent real-time writes be rewound and applied on top of the new generation of the dataset getting swapped in.

Leveraging this mechanism, it is possible to overlay the output of a stream processing job on top of that of a batch job. If using partial updates, then it is possible to have some of the columns be updated in real-time and some in batch, and these two sets of columns can either overlap or be disjoint, as desired.

Write Compute

Write Compute includes two kinds of operations, which can be performed on the value associated with a given key:

N.B.: Currently, write compute is only supported in conjunction with active-passive replication. Support for active-active replication is under development.

Read Path

Venice supports the following read APIs:

Client Modes

There are two main modes for accessing Venice data:

The table below summarizes the clients' characteristics:

Network Hops Typical latency (p99) State Footprint
Thin Client 2 < 10 milliseconds Stateless
Fast Client 1 < 2 milliseconds Minimal (routing metadata only)
Da Vinci Client (RAM + SSD) 0 < 1 millisecond Bounded RAM, full dataset on SSD
Da Vinci Client (all-in-RAM) 0 < 10 microseconds Full dataset in RAM

All of these clients share the same read APIs described above. This enables users to make changes to their cost/performance tradeoff without needing to rewrite their applications.

Resources

The Open Sourcing Venice blog and conference talk are good starting points to get an overview of what use cases and scale can Venice support. For more Venice posts, talks and podcasts, see our Learn More page.

Getting Started

Refer to the Venice quickstart to create your own Venice cluster and play around with some features like creating a data store, batch push, incremental push, and single get. We recommend sticking to our latest stable release.

Community

Feel free to engage with the community using our:

Follow us to hear more about the progress of the Venice project and community: