FIP: Dataflow topology builder and analytics

Motivation:

Understanding millions of related events and their relationship in a dataflow is impossible. Events span serverless functions, stream processors and microservices. Building that story and extracting instrumentation knowledge is virtually impossible. It should be possible to build a queryable dataflow model that supports instrumentation. The result would be the ability to scan millions of events (historcally and realtime), break them down by different attributes, identify anomolies, drill into the set of interest, and then down to a singular dataflow with relevant contextual information.

For example:

millions of IoT events -> those devices generating errors or performing slowly
millions of payments -> system latency degradation over time, identifying the types of payments of attributes affecting performance.

Proposed change

The ability to build a data flow model that support analytics such as number of flows, latency between steps. Break down millions of data flows against different attributes. Support filtering by analytics, attributes, latency and others. In someways it is equivalent to distributed tracing except it

also includes related logline against a correlation-Id (provides context).
doesn’t require special infrastructure (json data stored on cloud-storage)
extensible. (I.e. can be updated to support cloud events in future)
low impact to existing systems

V1 Mechanics:

Note limitations:

supports a single-correlation-id (no propagation fan-out etc),
rewriting correlation data (duplication)

Stage 1. Rewrite dataflows by time-correlation Format: {timestamp}-corr-id.log -> records[timestamp:data]

This makes each stage of the dataflow an append-only log of that stage of the correlations dataflow. Timestamps will distinguish between different states of the flow.

For example.

time-1: corr-a data-1
time-2: corr-b data-2
time-3: corr-a data-3

Yield 2 data files

time-1-corr-a.log -> records for corr-a
time-2-corr-b.log -> records for corr-b

Stage 2. calculate dataflow level analysis.index

For each dataflow generate span-information into a time-n-corr-m.index file

Stage 3. build per-day aggregate view of correlation data

Scan all index files to extract histogram analytics including:
- number of dataflows
- percentile breakdown of min,max,avg, 95th percentile, 99th percentile data
The output histogram will be stored using the timerange of the query to prevent recaulcation requirements.

Datamodel:

RPC call

I.e. Client->function(request)->client(result)

Client-app1.log { ts:’1234010’, corid:’aaaaa’} { ts:’1234011’, corid:’aaaaa’ stage:’request-calc’}

ServerlessFunction.log { ts:’1234021’, corid:’aaaaa’ stage:’bootstrap’ serverlessContextId:’1234:name:version’} { ts:’1234022, corid:’aaaaa’ stage:’starting’ serverlessContextId:’1234’} { ts:’1234023’, corid:’aaaaa’ msg:’processing some stuff for things’} { ts:’1234024’, corid:’aaaaa’ stage:’finished’ serverlessContextId:’1234’} Lambda finished processing: memory:1234 time:2345

Client-app1.log { ts:’1234025, corid:’aaaaa’ stage:’calc-finished’}

Single Call attributes collected: Raw Request start-time (corrid, timestamp, stage) Fun load-time, (fnId, timestamp) Fun handler start (corrid, fnId, timestamp, stage) Fun handler end (corrId, fnId, timestamp , stage, fn-stats) Response received-time (corrid, timestamp, stage)

Analytics

Call chain analytics Stage1->stage2 elapsed (states are labels on the log line) Stage2->stage3 elapsed ...etc...

Simple timeseries analysis: Represented as candle-chart?

Individual stage-analytics using percentile/max/avg stageA-stageB elapsed percentiles (label-stageA->stageB Expression: analytics.dataflow.stageStats(dataset-name?)
End-to-end stage analytics: Stage1->Stage4 elapsed using percentiles Expression: analytics.dataflow.endToEndStats(dataset-name?)
Filter stage-stage latency by value: Stage1->Stage4 elapsed > 100ms Expression: analytics.dataflow.stageFilter(stage1 - stage4 < 100) analytics.dataflow.endToEndStats(dataset-name?)
Branch RPC Client-app1.log { ts:’1234010’, corid:’aaaaa’} { ts:’1234011’, corid:’aaaaa’ stage:’request-calc-1’} { ts:’1234011’, corid:’aaaaa’ stage:’request-calc-2’} { ts:’1234011’, corid:’aaaaa’ stage:’request-calc-3’}

ServerFunction.logs (single source file) { ts:’1234021’, corid:’aaaaa’ stage:’bootstrap’ serverlessContextId:’1111:name:version’} { ts:’1234022, corid:’aaaaa’ stage:’starting’ serverlessContextId:’1111’} { ts:’1234024’, corid:’aaaaa’ stage:’finished’ serverlessContextId:’1111’} { ts:’1234021’, corid:’aaaaa’ stage:’bootstrap’ serverlessContextId:’2222:name:version’} { ts:’1234022, corid:’aaaaa’ stage:’starting’ serverlessContextId:’2222’} { ts:’1234024’, corid:’aaaaa’ stage:’finished’ serverlessContextId:’2222’} { ts:’1234021’, corid:’aaaaa’ stage:’bootstrap’ serverlessContextId:’3333:name:version’} { ts:’1234022, corid:’aaaaa’ stage:’starting’ serverlessContextId:’3333’} { ts:’1234024’, corid:’aaaaa’ stage:’finished’ serverlessContextId:’3333’}

Client-app1.log { ts:’1234025, corid:’aaaaa’ stage:’calc-finished’}

liquidlabsio / fluidity