Fluvio & InfinyOn Cloud users need the ability to perform operations on a time-bound window and generate a materialized view.
Motivation
Today, Fluvio supports record-by-record processing with the ability to apply transformation one record at a time. When a multi-record stream-based operation is required, Fluvio users create a Microservice that reads the records, applies an operation, and returns the result to a new stream. Unfortunately, these Microservices are managed independently of Fluvio, significantly increasing the complexity of building simple Real-Time Apps.
This PRD is a proposal to add the ability to compute aggregates inside Fluvio. This solution should eliminate the need to deploy and operate separate Microservices to perform stream-based computations.
In a larger context, time-based computations bring Fluvio closer to Flink and Spark, where our users won’t need to run multiple stacks to perform sums, averages, anomaly detections, etc.
Requirements
Fluvio's data streaming layer (aka topic/partitions) will continue operating as before. The stream processing component is an additional layer that runs on top. This stream-processing engine is defined as a separate object, as described below.
Example Use Case
We'll begin by describing a data streaming use case and a data set that we'll use to implement it.
Use Case
We want to build a data pipeline that captures the usage of cloud servers in terms of network storage and compute. In addition, we want to apply the price per unit and calculate the overall cost. The cost is calculated every minute and resets at each month's end.
Data Sets
The data sets are two data streams: metrics and pricing.
Metrics
Each metrics event has a key and a value. The key is the server name, and the value store the metric type, value, and timestamp:
Next, we'll use the data sets to create the materialized views.
Materialized Views
Fluvio operates on binary records, where the data interpretation is opaque to the system. However, with stream-based computation, the system must understand the data it operates on.
Defining a materialized view in Fluvio requires the following steps:
Define a column schema yaml definition.
Create a topic and apply the column schema.
Define a materialized view yaml file.
Create a view and apply the materialized view definition.
Joining materialized views is a derivative of the operations above, where a materialized view references another to derive a composite result.
Let's create the metrics materialized view that computes an aggregate for each server and metric for the current month.
1. Define a Column Schema Definition
The column schema definition file has a dual purpose: to validate and map records from the data stream into a memory representation.
The expected input format is json, and the data mapping is performed based on name. The order of the items in the file defines the order in the resulting table.
Definitions
key reads this field from the key of the record.
there can only be up to one key column per schema file
optional allows records without this field to be parsed successfully.
optional columns are internally represented as rust options
validate invokes a smartmodule that ensures the record is compliant
map invokes a smartmodule to convert an item into the desired output
2. Create a Topic and Apply the Column Schema
Create a topic and apply the column schema. These topics are columnar topics.
Create a Columnar Topic
Create a topic a columnar topic called metrics as follows:
Columnar topics are identifyed by the COLUMNS flag:
$ fluvio topic list
NAME COLUMNS TYPE PARTITIONS REPLICAS RETENTION COMPRESSION STATUS REASON
metrics Y computed 1 1 7days any provisioned
topic-1 computed 1 1 7days any provisioned
3. Create a Materialized View Configuration file
Create a materialized view definition file usage-view.yaml, to describe the target topic, operation, and output of the materialized view:
The table is retrieve, and the command exists. Reading the view multiple times, retrieves the same values until the next refresh interval.
$ fluvio view consume usage -d
View Commands
The view object has the following commands:
$ fluvio view -h
create Provision a view
consume Read the table produced by the view
eval invoke an API from the view
list List all views
describe Show configuration parameters
delete Delete a view
In summary, to crete a materialized view, we need to:
Build a column definition schema
Create a columnar topic
Build a materialized view definition file
Create the view
Join Materialized Views
Join is the most requested operation for materialized views. In this section, we'll add a pricing and join it with usage to compute the usage cost.
Joining materialized views has two steps:
Create the view providing the data
Create the view consuming the data
Let's create get started.
1. Create the View Providing the Data
The view providing the data is a pricing view. We'll go through the same steps as above to crate this view.
In addition to the output, this materialized view also offers a smartmodule called getPrice. The smartmodule takes 2 parameters, and returns the prices for the specific timestamp. The smartmodule was built by john and he published it in the smartmodule hub.
Create the pricing view:
$ fluvio view create pricing pricing-view.yaml
view created
This view defines a new derivedColumn that evaluates pricing.getPrice with the metric and timestamp values from metrics topic and returns the reult in cost. The cost is used in the output to compute the the final result.
Definitions
derivedColumns allows a view to reference smartmodules from other views.
field is the name of the new column.
eval is the routine to be invoke
before . is the view name: pricing
after . is the API to evaluate: getPrice
parameters reference values in the metrics table:
$.metric - metric value in the current row
$.ts - timestamp in the current row
The result is stored in the column, for ts="2023-02-18 06:41:48" & metric="network" it would be 0.6 .
operation sum(cost * count) takes the cost from derived column and multiplies with count.
Create usage-pricing view:
$ fluvio view create usage-pricing usage-pricing-view.yaml
view created
Fluvio & InfinyOn Cloud users need the ability to perform operations on a time-bound window and generate a materialized view.
Motivation
Today, Fluvio supports record-by-record processing with the ability to apply transformation one record at a time. When a multi-record stream-based operation is required, Fluvio users create a Microservice that reads the records, applies an operation, and returns the result to a new stream. Unfortunately, these Microservices are managed independently of Fluvio, significantly increasing the complexity of building simple Real-Time Apps.
This PRD is a proposal to add the ability to compute aggregates inside Fluvio. This solution should eliminate the need to deploy and operate separate Microservices to perform stream-based computations.
In a larger context, time-based computations bring Fluvio closer to Flink and Spark, where our users won’t need to run multiple stacks to perform sums, averages, anomaly detections, etc.
Requirements
Fluvio's data streaming layer (aka topic/partitions) will continue operating as before. The stream processing component is an additional layer that runs on top. This stream-processing engine is defined as a separate object, as described below.
Example Use Case
We'll begin by describing a data streaming use case and a data set that we'll use to implement it.
Use Case
We want to build a data pipeline that captures the usage of cloud servers in terms of network storage and compute. In addition, we want to apply the price per unit and calculate the overall cost. The cost is calculated every minute and resets at each month's end.
Data Sets
The data sets are two data streams: metrics and pricing.
Metrics
Each metrics event has a key and a value. The key is the server name, and the value store the metric type, value, and timestamp:
Pricing
The Pricing data set stores the price per metric and the timestamp when it was updated:
Next, we'll use the data sets to create the materialized views.
Materialized Views
Fluvio operates on binary records, where the data interpretation is opaque to the system. However, with stream-based computation, the system must understand the data it operates on.
Defining a materialized view in Fluvio requires the following steps:
Joining materialized views is a derivative of the operations above, where a materialized view references another to derive a composite result.
Let's create the
metrics
materialized view that computes an aggregate for each server and metric for the current month.1. Define a Column Schema Definition
The column schema definition file has a dual purpose: to validate and map records from the data stream into a memory representation.
The expected input format is
json,
and the data mapping is performed based onname.
The order of the items in the file defines the order in the resulting table.Definitions
key
reads this field from the key of the record.key
column per schema fileoptional
allows records without this field to be parsed successfully.optional
columns are internally represented asrust options
validate
invokes a smartmodule that ensures the record is compliantmap
invokes a smartmodule to convert an item into the desired output2. Create a Topic and Apply the Column Schema
Create a topic and apply the column schema. These topics are
columnar topics.
Create a Columnar Topic
Create a topic a
columnar topic
calledmetrics
as follows:Use
metrics.json
file we defined above to load events into the topic:Parsing behavior
Inspect the topic
Inspect the uploaded data. While
columnar topics
can natively produce tables, they require--output table
for backward compatibility:Columnar topics are identifyed by the
COLUMNS
flag:3. Create a Materialized View Configuration file
Create a materialized view definition file
usage-view.yaml
, to describe the target topic, operation, and output of the materialized view:Definitions
topic
- target topic for the materialized view (restricted to columnar topics)window
- defines range (from
until now) and refresh interval:from
-python code
to computefirst day of month
- “2023-03-01T00:00:00.000Z” and convert to millisecond timestamp: 1679035308000.interval
- the time interval after which a new window is recomputed (humanized).groupBy
- groups records for for the operation specified in the “output”.conditions
(optional) - allows for additional query refinement=
,<
,>
,and
,or
,not
field
- represent the column to be displayed in the outputoperation
to perform a computationsum
,aggregate
,min
,max
,count
+
,-
,*
,/
sum(count * price)
label
to rename the columnNote: A columnar topic may have as many materialized views as desired.
4. Create a View and Apply the Materialized View Definition.
We are introducing a new object called
view
to manage materialized views.Create a View object
Apply configuration file to create a materialized view object:
The materialized begins stream processing as soon as it's applied.
Inspect the View
List views:
Consume from View
Consuming from a view is similar to consuming from a topic, except that the output is in table format.
Streaming (default)
The table is automatically updated at refresh interval
Snapshot
The table is retrieve, and the command exists. Reading the view multiple times, retrieves the same values until the next refresh interval.
View Commands
The view object has the following commands:
In summary, to crete a materialized view, we need to:
Join Materialized Views
Join is the most requested operation for materialized views. In this section, we'll add a
pricing
and join it withusage
to compute the usage cost.Joining materialized views has two steps:
Let's create get started.
1. Create the View Providing the Data
The view providing the data is a
pricing
view. We'll go through the same steps as above to crate this view.Create
pricing
Columnar TopicCreate column schema definition
pricing-columns.yaml
file:Create
pricing
columnar topic:Add pricing data from
JSON
file defined above to the topic:Consume from the pricing view:
Create
pricing
ViewCreate
pricing-view.yaml
definition file:In addition to the
output
, this materialized view also offers asmartmodule
calledgetPrice
. The smartmodule takes 2 parameters, and returns the prices for the specifictimestamp
. The smartmodule was built byjohn
and he published it in thesmartmodule hub
.Create the
pricing
view:Consume from
pricing
view:Test
getPrice
API:1. Create the View Consuming the Data
In our use case the consumer of the
pricing
view is theusage
view, with a new column for the price.Let's define a new
usage-pricing-view.yaml
view:This view defines a new
derivedColumn
that evaluatespricing.getPrice
with the metric and timestamp values frommetrics
topic and returns the reult incost
. Thecost
is used in the output to compute the the final result.Definitions
derivedColumns
allows a view to referencesmartmodules
from other views.field
is the name of the new column.eval
is the routine to be invoke.
is theview
name: pricing.
is the API to evaluate: getPrice$.metric
- metric value in the current row$.ts
- timestamp in the current row0.6
.sum(cost * count)
takes thecost
from derived column and multiplies with count.Create
usage-pricing
view:Consume from
usage-pricing
:In summary, to create a join, we need to: