Understanding millions of related events and their relationship in a dataflow is impossible. Events span serverless functions, stream processors and microservices. Building that story and extracting instrumentation knowledge is virtually impossible. It should be possible to build a queryable dataflow model that supports instrumentation. The result would be the ability to scan millions of events (historcally and realtime), break them down by different attributes, identify anomolies, drill into the set of interest, and then down to a singular dataflow with relevant contextual information.
For example:
millions of IoT events -> those devices generating errors or performing slowly
millions of payments -> system latency degradation over time, identifying the types of payments of attributes affecting performance.
Proposed change
The ability to build a data flow model that support analytics such as number of flows, latency between steps. Break down millions of data flows against different attributes. Support filtering by analytics, attributes, latency and others. In someways it is equivalent to distributed tracing except it
also includes related logline against a correlation-Id (provides context).
doesn’t require special infrastructure (json data stored on cloud-storage)
extensible. (I.e. can be updated to support cloud events in future)
low impact to existing systems
V1 Mechanics:
Note limitations:
supports a single-correlation-id (no propagation fan-out etc),
rewriting correlation data (duplication)
Stage 1. Rewrite dataflows by time-correlation
Format: {timestamp}-corr-id.log -> records[timestamp:data]
This makes each stage of the dataflow an append-only log of that stage of the correlations dataflow. Timestamps will distinguish between different states of the flow.
Motivation:
Understanding millions of related events and their relationship in a dataflow is impossible. Events span serverless functions, stream processors and microservices. Building that story and extracting instrumentation knowledge is virtually impossible. It should be possible to build a queryable dataflow model that supports instrumentation. The result would be the ability to scan millions of events (historcally and realtime), break them down by different attributes, identify anomolies, drill into the set of interest, and then down to a singular dataflow with relevant contextual information.
For example:
Proposed change
The ability to build a data flow model that support analytics such as number of flows, latency between steps. Break down millions of data flows against different attributes. Support filtering by analytics, attributes, latency and others. In someways it is equivalent to distributed tracing except it
V1 Mechanics:
Note limitations:
Stage 1. Rewrite dataflows by time-correlation Format: {timestamp}-corr-id.log -> records[timestamp:data]
This makes each stage of the dataflow an append-only log of that stage of the correlations dataflow. Timestamps will distinguish between different states of the flow.
For example.
Yield 2 data files
Stage 2. calculate dataflow level analysis.index
For each dataflow generate span-information into a time-n-corr-m.index file
Stage 3. build per-day aggregate view of correlation data
Datamodel:
I.e. Client->function(request)->client(result)
Client-app1.log { ts:’1234010’, corid:’aaaaa’} { ts:’1234011’, corid:’aaaaa’ stage:’request-calc’}
ServerlessFunction.log { ts:’1234021’, corid:’aaaaa’ stage:’bootstrap’ serverlessContextId:’1234:name:version’} { ts:’1234022, corid:’aaaaa’ stage:’starting’ serverlessContextId:’1234’} { ts:’1234023’, corid:’aaaaa’ msg:’processing some stuff for things’} { ts:’1234024’, corid:’aaaaa’ stage:’finished’ serverlessContextId:’1234’} Lambda finished processing: memory:1234 time:2345
Client-app1.log { ts:’1234025, corid:’aaaaa’ stage:’calc-finished’}
Single Call attributes collected: Raw Request start-time (corrid, timestamp, stage) Fun load-time, (fnId, timestamp) Fun handler start (corrid, fnId, timestamp, stage) Fun handler end (corrId, fnId, timestamp , stage, fn-stats) Response received-time (corrid, timestamp, stage)
Analytics
Call chain analytics Stage1->stage2 elapsed (states are labels on the log line) Stage2->stage3 elapsed ...etc...
Simple timeseries analysis: Represented as candle-chart?
Individual stage-analytics using percentile/max/avg stageA-stageB elapsed percentiles (label-stageA->stageB Expression:
analytics.dataflow.stageStats(dataset-name?)
End-to-end stage analytics: Stage1->Stage4 elapsed using percentiles Expression:
analytics.dataflow.endToEndStats(dataset-name?)
Filter stage-stage latency by value: Stage1->Stage4 elapsed > 100ms Expression:
analytics.dataflow.stageFilter(stage1 - stage4 < 100)
analytics.dataflow.endToEndStats(dataset-name?)
Branch RPC Client-app1.log { ts:’1234010’, corid:’aaaaa’} { ts:’1234011’, corid:’aaaaa’ stage:’request-calc-1’} { ts:’1234011’, corid:’aaaaa’ stage:’request-calc-2’} { ts:’1234011’, corid:’aaaaa’ stage:’request-calc-3’}
ServerFunction.logs (single source file) { ts:’1234021’, corid:’aaaaa’ stage:’bootstrap’ serverlessContextId:’1111:name:version’} { ts:’1234022, corid:’aaaaa’ stage:’starting’ serverlessContextId:’1111’} { ts:’1234024’, corid:’aaaaa’ stage:’finished’ serverlessContextId:’1111’} { ts:’1234021’, corid:’aaaaa’ stage:’bootstrap’ serverlessContextId:’2222:name:version’} { ts:’1234022, corid:’aaaaa’ stage:’starting’ serverlessContextId:’2222’} { ts:’1234024’, corid:’aaaaa’ stage:’finished’ serverlessContextId:’2222’} { ts:’1234021’, corid:’aaaaa’ stage:’bootstrap’ serverlessContextId:’3333:name:version’} { ts:’1234022, corid:’aaaaa’ stage:’starting’ serverlessContextId:’3333’} { ts:’1234024’, corid:’aaaaa’ stage:’finished’ serverlessContextId:’3333’}
Client-app1.log { ts:’1234025, corid:’aaaaa’ stage:’calc-finished’}