Closed YANG-DB closed 2 years ago
# Gauge represents the type of a scalar metric that always exports the
# "current value" for every data point. It should be used for an "unknown"
# aggregation.
type Gauge implements MetricsData{
points:[NumberDataPoint]
aType:AggType
}
For supporting different types of metrics like Gauge, Summary, Counter..would we be having different types of implementations as mentioned above? What is AggType here? and what values does it take?
Are we going to use GraphQL just for datatype modelling and using them across the stack or are we going to replace existing plugin with GraphQL server? Any idea on where are we going to host this GraphQL server?
Regarding the aggType - it is a general classifies shared by all the implementing of the MetricData Interface
enum AggType {
Gauge
Sum
Histogram
ExponentialHistogram
Summary
}
The way we implement the actual index structure may differ a bit from its logical representation - for optimization sake and other concerns - but in general yes, I think it makes sense for each category of metrics to have a different mapping template
regarding the usage of graphQL -graphQL has several scopes:
I think that for step one - we can introduce a high level schema language which is very popular and widely adopted and common by the both community and top products.
Defining the Observability domain using this language will allow us to have a separate (decoupled) logical layer which is both mature and customizable...
Using the graphQL query language for search queries (new endpoint ?) will be the next steps IMO...
Issue moved to the observability project
OpenSearch Observability Simple Schema Draft & Components
The purpose of this RFC is to propose a unified schema structure to the observability domain. This schema is largely based on former work done both by OpenTelemetry project and ElasticCommonSchema project.
Additional important aspect of this work
Links:
Introduction
The 3 Pillars of Observability
Logs, metrics, and traces are known as the three pillars of observability. While plainly having access to logs, metrics, and traces doesn’t necessarily make systems more observable, these are powerful tools that can unlock the ability to build better systems.
Event Logs
An event log is an immutable, timestamped record of discrete events that happened over time, logs almost always carry a timestamp and a payload of some context. Event logs are helpful when trying to uncover emergent and unpredictable behaviors exhibited by components of a distributed system. Unfortunately simply by looking at discrete events that occurred in any given system at some point in time, it's practically impossible to determine all such triggers.
In order to actually understand the root cause of some misbehaving functionality we need to do the following:
Infer the request lifecycle across different components of the distributed architecture
Iteratively ask questions about interactions among various parts of the system
It is also necessary to be able to infer the fate of a system as a whole (measured over a duration that is orders of magnitudes longer than the lifecycle of a single request).
Traces and metrics are abstractions built on top of logs that pre-process and encode information along two orthogonal axes,
Data Ingestion
During Ingestion, raw logs are almost always normalized, filtered, and processed by a tool like Logstash, fluentd, Scribe, or Heka before they’re persisted in a data store Interesting observation is that logs arrive as a stream of data and can be analyzed using the streaming data analysis tool and concepts, these concepts include:
Metrics
Metrics are a numeric representation of data measured over intervals of time. Metrics use the power of mathematical modeling and prediction to derive knowledge of the behavior of a system over intervals of time .
Since numbers are optimized for storage, processing, compression, and retrieval, metrics enable longer retention of data as well as easier querying. Metrics are perfectly suited to building dashboards that reflect historical trends. Metrics also allow for gradual reduction of data resolution - after a certain period of time, data can be aggregated into daily or weekly frequency.
In a nutshell - A metric is identified using both the metric name and the labels which are also called dimensions. The actual data stored in the time series is called a sample, and it consists of two components: a float64 value and a timestamp. Metrics follow the 'append-only' rule and are immutable for altering the labels.
A large advantage of metrics-based monitoring over logs is that unlike log generation and storage, metrics transfer and storage has a constant overhead. Metrics storage increases with more permutations of label values (e.g., when more hosts or containers are spun up, or when new services get added or when existing services get instrumented more),
Traces
A drawback with both application logs and application metrics is that they are system scoped, making it hard to understand what’s happening inside a particular system.
With logs (without using some sort of joins), a single line doesn’t give much information about what happened to a request across all components of a system.
It is possible to construct a system that correlates metrics and logs across the address space or RPC boundaries using some UID. Such systems require a metric to carry a UID as a label.
Using high cardinality values like UIDs as metric labels can overwhelm time-series databases.
When used optimally, logs and metrics give us complete omniscience into a silo, but not much more.
Distributed tracing is a technique that addresses the problem of bringing visibility into the lifetime of a request across several systems.
A trace is a representation of a series of causally related distributed events that encode the end-to-end request flow through a distributed system. Traces are a representation of logs; A single trace can provide visibility into both the path traversed by a request and the structure of a request. The path of a request allow the understanding the different services involved in that path.
The basic idea behind tracing is to identify specific points (function call / RPC boundaries / threads / queues) in an application, proxy, framework, library, runtime, middleware, and anything else in the path of a request that represents the following:
Traces are used to identify the amount of work done at each layer while preserving causality by using happens-before semantics. A trace is a directed acyclic graph (DAG) of spans, where the edges between spans are called references.
Trace Parts
When a request begins, it’s assigned a globally unique ID, which is then propagated throughout the request path so that each point of instrumentation is able to insert or enrich metadata before passing the ID around to the next hop in the flow of a request. Each hop along the flow is represented as a span, when the execution flow reaches the instrumented point at one of these services, a record is emitted along with metadata.
These records are usually logged to disk before being submitted to a collector, which then can reconstruct the flow of execution based on different records emitted by different parts of the system.
Traces are primarily for
Zipkin and Jaeger are two of the most popular OpenTracing-compliant open source distributed tracing solutions.
For tracing to be truly effective, every component in the path of a request needs to be modified to propagate tracing information - directly or using augmentation based libraries.
Observability Schema
In many regards, observability and security event share many common aspects and features that are an important concern for similar or event the same stakeholders.
Many attempts at a common data format for security/observability - type events have been created over the years:
None of these formats has become the true industry standard, and while many observability tools and appliances support export into one of these data formats, it is just as common to see data being emitted by logging tools using Syslog or CSV formats. At the same time, the rise of SaaS tools and APIs means that more and more data is being shared in JSON format, which often doesn’t translate well to older, less-extensible formats.
GraphQL Schema-Definition-Language
In order to maintain a higher level of abstraction and to provide a general capability for multi-purpose usability, the popular and highly supported graphQL language is selected. GraphQL provides a complete description of the data & gives clients the power to ask for exact & specific structured information in a simple manner. GraphQL stack offers a rich echo-system of polyglot support for code generation and endpoint libraries.
Selection to represent the Observability domain using graphQL semantics will help in all these capabilities and more
Example
Let's review the schema of a 'network' type of log:
Network is defined as a type implementing a basic interface (BaseRecord) which encapsulates common records for all events. The schema also supports different field type such as primitives and custom types such as JSON and IP.
Network defines NetworkDirection enumeration for the network event's direction which categorize the network direction. Network also defines the vlan object structure which can be queries directly using the graphQL query language. The Vlan object has a @relation directive assign to it under the definition of the network object.
Directive Directives are an aspect based instruction to the model, they are intended to be interpreted by the parsing tool and handle the directive in a proactive manner that will best reflect their meaning to the specific role.
In the @relation case - we can interpret this relational structure directive for multiple purposes:
Additional supported directives:
Custom directives can be added and their interpretation is in the hands of each parser.
Schema Structure & Domain Entities
As states in the introduction, the 3 pillars of the observability are the Logs, Traces & Metrics
Logs
The log's schema mostly follows ECS (Elastic Common Schema) methodology regarding the entities and their fields.
The base entity holds the common fields for the top level events - these include:
All the deriving events share these fields and add additional related content.
Log static classification
At a high level, the log classification technique provides fields to classify events in two different ways:
The entity that construct the classifications:
As shown in the concrete schema, the Categorization is composed of 4 fields which form togather both the 'origin' and the 'purpose' of the event.
Each event entity contains this classification element.
Log dynamic classification
Data stream naming scheme uses the value of the data stream fields combine to the name of the actual data stream in the following manner: {dataStream.type}-{dataStream.dataset}-{dataStream.namespace}.
This means the fields can only contain characters that are valid as part of names of data streams
This additional dynamic customizable classification field simplify the distinctions of logs arriving for specific customer-meaningfull sources. The classification of the streams is divided into 3 categories:
stream type - the distinction of the logs
stream name - identifies the name of the stream - for example its purpose, service component, region
stream custom name - identifies some customer distinction - for example environment type (dev,test,prod)
As the log data is split up per data set across multiple data streams, each data stream contains a minimal set of fields. This leads to better space efficiency and faster queries. More granular control of the data, having the data split up by data set and namespace allows granular control over retention and security . Flexibility, users can use the namespace to divide and organize data in any way they want.
Logs Record Structure
Every signal arriving from an observing entity (An observer is defined as a special network, security, or application device used to detect, observe, or create network, security, or application-related events and metrics) has the following basic composition:
This general purpose log container reflects the possible different observations that are reported. The Event entity represents metadata related concerns of the log itself such as:
Examples
TODO - add examples
Traces
As described in the introduction, a Span represents a unit of work or operation. It tracks specific operations that a request makes, describing what happened during the time in which that operation was executed. A Span contains name, time-related data, structured log messages, and other metadata (i.e. Attributes) to provide information about the operation it tracks.
Distributed Traces
A Distributed Trace, more commonly known as a Trace, records the paths taken by requests (made by an application or end-user) as they propagate through the different layers of the architectures. Trace improves the visibility of an application or system’s health and allows to understand debug behavior that is difficult to reproduce locally. A Trace is made of one or more Spans - yhe first Span represents the Root Span. Each Root Span represents a request from start to finish and the Spans underneath the parent provide a more in-depth context of what occurs during a request (or what steps make up a request).
Span Categorization:
SpanKind describes the relationship between the Span, its parents, and its children in a Trace. SpanKind describes two independent properties that benefit tracing systems during analysis.
Span's Context represents all the information that identifies Span in the Trace and MUST be propagated to child Spans and across process boundaries.
A span represents a single operation within a trace. Spans can be nested to form a trace tree. Spans may also be linked to other spans from the same or different trace. And form graphs. Often, a trace contains a root span that describes the end-to-end latency, and one or more subspans for its sub-operations. A trace can also contain multiple root spans, or none at all. Spans do not need to be contiguous - there may be gaps or overlaps between spans in a trace.
Metrics
Metrics are everywhere, they can be generated from logs information within logs or summarization of numerical values and counts. As described in the introduction section, a metrics comprises a set of dimensions, and a list of timestamp and a value tuples.
A metrics can originate from an agent sampling some features on the observed machine or by actually performing a statistical action on top of the raw logs.
The possible types of metrics are
As stated before, every metrics has a name, type, a list of data-points:
The data container is the set of datapoints belonging to the specific metrics, we can actually think of the data-points as a time series of samples for a specific feature with timestamp and set of labels.
This is the graphQL schematic representations:
OpenSearch Observability Index Generation Support
In order for the observability analytics dashboards to take the full power of this schema, we need to support it using structured indices. Utilizing the code generation capabilities of graphQL based schema we will create a template generator which is based on these definitions.
Each type of generator will be activated using a CLI.
The template generator engine will work in the following steps:
1) First Step will create two intermediate file representation of each graphQL schema elements:
2) Second Step will generate a set of index templates which are composable template mappings that can be used together in a composite template. For additional information check Appendix A,B
Index Template Mapping Composition
In Order to fully utilize the composable nature of the Observability building blocks, we are using the composable index template mapping capability.
Observability schema comes with a defined set of entities, these entities can be used as building block with the idex-template-mapping generator to create in advanced a structured index containing specific type of entities (logs) that can be used for many purposes:
Example
Let's use the next log type entities to compose a specific index for a particular aggregation purpose:
TODO - show final result here
Conclusion
Appendix A: Ontology Definition Language
GraphQL describes very well the structure and composition of entities, including the queries and API. GraphQL lacks the description of explicit defining relationships between entities. It does so by implicitly understanding the composition tree structure of the entities and in the specific way each parser translates the @relation directives.
Using a dedicated language to describe the ontology () we can further enrich our understanding of the schema and explicitly define these relationships as a fully qualified member of the schema language.
Once translating the graphQL into an internal representation of the domain - give us an addition phase to add custom related transformations and the ability to decouple the exact usage pattern in the underlying storage engine from the logical concepts.
Let's review the a 'Client' event entity - once in the GraphQL schema format and next in the ontological description.
The GraphQL schema has defined that there are 3 logical directives
The client.json generated ontology file represents these implicit concerns in a more explicit and formal manner:
The explicit relationshipTypes list states the first class relationship entity. An additional low level 'index-provider' instruction configuration file is auto-generated according to this SDL file. Appendix A details the low level instructions file on how to construct the index template mappings for the above schema.
Appendix B: Index Provider Physical Storage Configuration
The index-provider is what helps opensearch analyze the schema instruction file and assemble the required indices and mappings.
Let's review the file to understand the instructions:
Appendix C: Logs Index Implementation Considerations
// TODO - add different log-index size, access-pattern, terms cardinality, compaction aspects
Appendix D: Metrics Index Implementation Considerations
// TODO - add different metrics-index size, dimensions cardinality, aggregations aspects