[Feature request] Add field mapping correlation type metadata concept

YANG-DB commented 1 year ago

Is your feature request related to a problem? As part of the Integration campaign and [Integration RFC(https://github.com/opensearch-project/OpenSearch-Dashboards/issues/3412) , we have introduction the SimpleSchema for Observability Domain that is based on the concept of a well-structured index which is based on a schema

Schema A schema is associated to an index using the mapping configuration .

This mapping structure is also composable using the composed_of template capabilities which is used extensively to allow the different assemblies of various log types.

Another concept behind the schema is the capability of reflecting relationships. This representation is currently defined in a proprietary way of adding this information to the index mapping template's metadata

In the Observability domain - a log's entity relationship to a trace entity (:log)-[:associated]-(:trace) using the traceId correlation field is described in the log's mapping metadata section:

 "_meta": {
        "description": "Simple Schema For Observability",
        "catalog": "observability",
        "type": "logs",
        "correlations": [
          {
            "field": "spanId",
            "foreign-schema": "traces",
            "foreign-field": "spanId"
          },
          {
           "field": "traceId",
            "foreign-schema": "traces",
            "foreign-field": "traceId"
           }
          ]
        }

Screenshot 2023-04-04 at 10 19 22 AM

What solution would you like?

I would like that the field mapping API would be extended with this metadata information.

Recently there have been large extensions in the conceptual operation of opensearch as a search engine. These extensions include:

Federation of queries from different external sources datasource
Adding materialized view backing external data-lake storage Query S3 in OpenSearch Observability
Adding BloomFilter data sketch as a new data-type bloomFilter data-type

The evolution of the knowledge layer on top of the data layer is an existing trend both in opensearch and in additional storage engines.

Key part of any knowledge layer is the concept of relationships between the different Entities .

P1 - The First Step

This step includes the introduction of the correlations concept into the field mapping.

Even though the concept of index relationships does exist today:

Both options imply a physical explicit index interrelationship that has a strong side effect of [index physical storage]() and query time. In addition, the specific field mapping has no reflection of this join which is only present in the higher index mapping level.

The new field-mapping-correlation feature is addressing the metadata aspect of the relationship between well-structured entities residing in different indices.

A correlation is a weaker constraint in the sense that it doesn't impose a relational like DB foreign key constraint but rather implies that such correlation exist and may be joined using a query engine

Another difference from the existing join fields is that this correlation will be at first a metadata declarative definition that will not be enforced with respect to the actual data inside the indices - only the mapping correlation metadata will be enforced as detailed below.

New Correlation Section in Field mapping

Field mapping for a field which has a relationship to another foreign field in the target entity's index: GET log/_mapping/field/traceId

Will respond with:

{
  "logs": {
    ...
    "mappings": {
      ...
        "traceId": {
          "ignore_above": 256,
          "type": "keyword"
        },
        "spanId": {
          "ignore_above": 256,
          "type": "keyword"
        },
        "traceIdFk": {
          "type": "correlation",
          "path": "traceId",
           "target_schema": "traces",
            "target_field":"traceId"
        },
        "spanIdFk": {
          "type": "correlation",
          "path": "spanId",
           "target_schema": "traces",
            "target_field":"spanId"
        },

    }
  }
}

This metadata information will be used by the SQL / PPL query engine to allow explicit correlation between different data-streams or datasources. Having this information explicitly will allow better understanding and enhance investigation capabilities.

Once a SQL / PPL correlation (join) query is submitted to the corresponding index - it will create a regular sql join query.

Enforcement

In the first P1 step the mapping API would enforce the following when a field mapping correlation is requested:

validate target index schema foreign-schema mapping exists ( in the above example the "foreign-schema": "traces" must imply an index template traces exist)
validate target index schema foreign-field mapping exists ( in the above example the "foreign-field": "traceId" must imply a field named traceId must exist)
Field type must be in sync between the source and target field as well.

The correlations field may accept multiple correlations for additional remote indices including remote tables including datasources

P2 - The next Step

The next phase of the correlation capability would be including the actual precompute of the correlated data using some auxiliary data structure / indices The auxiliary data structure may take the form of an eager correlation task which precomputes the join and materialized it into a secondary storage. An additional skipping-index can be introduced to further optimize the filter based queries using bloomfilter of other probabilistic data sketch

The result of an SQL query would be much faster due to these auxiliary structures and allow faster and investigative driven use cases on top of huge indices and event data-lake based correlations.

What alternatives have you considered? A clear and concise description of any alternative solutions or features you've considered.

Do you have any additional context?

saratvemulapalli commented 1 year ago

@YANG-DB are you looking for feedback or would contribute these changes?

YANG-DB commented 1 year ago

I wanted to get feedback on this suggestion and how it fits with the current correlation initiative

opensearch-project / OpenSearch