opensearch-project / OpenSearch-Dashboards

📊 Open source visualization dashboards for OpenSearch.
https://opensearch.org/docs/latest/dashboards/index/
Apache License 2.0
1.6k stars 820 forks source link

Integrate Vega Vis into VisBuilder Proposal #7067

Open ananzh opened 2 weeks ago

ananzh commented 2 weeks ago

Background

The primary problem we are addressing is the need for more advanced and customizable data visualization capabilities in OpenSearch Dashboards. While VisBuilder reached General Availability (GA) in version 2.15, it is currently limited to a few chart types and lacks the comprehensive set of controls necessary for complex visualizations. Enhancing VisBuilder to incorporate more complex controls will provide users with powerful tools for data analysis and reporting, thereby improving the overall user experience and functionality of OpenSearch Dashboards. Additionally, from a technical perspective, we aim to streamline the visualization process by consolidating the multiple existing libraries (such as timeline, vislib, and vega) into a single, cohesive library. This unification will simplify the development and maintenance of visualizations, ensuring consistency and ease of use for developers and users alike.

Requirements and Considerations

Requirements

Technical Requirements:

Non-Technical Requirements:

Considerations and Optimizations

Optimizations:

Non-Prioritized Aspects:

Out of Scope

Current Workflow

VisLib in VisBuilder Workflow

Vega Vis Workflow

Proposed Design

Key Deliveries for 2.16

Note: This is not a complete version. It is just for demo purpose.

https://github.com/opensearch-project/OpenSearch-Dashboards/assets/79961084/c93519b8-4eb7-437b-b19a-c6f710faeffd

1. Vega Integration in VisBuilder

2. Advanced setting to allow user to use vega to create visualizations in VisBuilder

This includes modifications in VisBuilder for each chart type to use either visualization expression or vega expression. The main purpose is to avoid any breaks for user experience. New controls will only be added in vega vis.

Screenshot 2024-06-15 at 4 42 13 PM

3.Easy migration from VisLib visualization created by VB to vega vis. Allow embed both visualizations in Dashboard .

Allow save vislib vis or vega vis: the only difference in the url is useVegaRendering value in style state which will decide whether use visualization expression or vega expression. when useVegaRendering is true, render vega in VisBuilder with toggle turned on.

/vis-builder/edit/471fa110-2ba8-11ef-b457-4707dd1c36d9#?
_q=(filters:!(),query:(language:kuery,query:''))&
_a=(metadata:(editor:(errors:(),state:loading)),
style:(addLegend:!t,addTooltip:!t,legendPosition:right,type:area,useVegaRendering:!f), // different part
ui:(),visualization:(activeVisualization:(aggConfigParams:!(),name:area),
indexPattern:ff959d40-b880-11e8-a6d9-e546fe2bba5f,searchField:''))&
_g=(filters:!(),refreshInterval:(pause:!t,value:0),time:(from:now-15d,to:now))

Same as embedded to Dashboard: when saved with useVegaRendering to true, embed vega vis in Dashboard

Screenshot 2024-06-17 at 10 18 13 AM

4.More controls to line chart.(Optional) Use line chart as an example to integrate all the controls from line visualization to Vis-Builder line vega chart. Optional: add 1-2 new controls

Implementation Details regarding the VegaSpecBuilder Class

Method 1: Passing the Whole Aggregation (Aggs) as Input

Key Differences from Static Vega Spec Input:

1. Extend the Visualization Slice for context integration

We'll extend the existing visualization slice to include Vega-specific state and actions:

import { createSlice, PayloadAction } from '@reduxjs/toolkit';
import { CreateAggConfigParams } from '../../../../../data/common';
import { VisBuilderServices } from '../../../types';
import { setActiveVisualization } from './shared_actions';

export interface VegaState {
  dataUrl: any;
  transforms: any[];
  encoding: any;
  aggs: any;
  indexPattern: string | null;
  metrics: any[];
  buckets: any[];
  tooltip: any;
  timeField: string;
  split: any[];
  group: any[];
  segment: any[];
  type: string;
  useVegaRendering: boolean;
}

export interface VisualizationState {
  indexPattern?: string;
  searchField: string;
  activeVisualization?: {
    name: string;
    aggConfigParams: CreateAggConfigParams[];
    draftAgg?: CreateAggConfigParams;
  };
  vega: VegaState;
}

// ... (keep existing initial state and preloaded state logic)

export const slice = createSlice({
  name: 'visualization',
  initialState,
  reducers: {
    // ... (keep existing reducers)

    // Add Vega-specific reducers
    setVegaTooltip: (state, action: PayloadAction<any>) => {
       state.vega.tooltip = action.payload;
    },
    setVegaAggs: (state, action: PayloadAction<any>) => {
       state.vega.aggs = action.payload;
    },
    setVegaTransforms: (state, action: PayloadAction<any[]>) => {
       state.vega.transforms = action.payload;
    },
    setVegaEncoding: (state, action: PayloadAction<any>) => {
       state.vega.encoding = action.payload;
    },
    // Add more Vega-specific reducers as needed
  },
  // ... (keep existing extra reducers)
});

// Export actions
export const {
  // ... (keep existing action exports)
  setVegaTooltip,
  setVegaAggs,
  setVegaTransforms,
  setVegaEncoding,
} = slice.actions;

2. Data Retrieval with proper aggregations: Utilize opensearchaggs to retrive aggs directly

Update the opensearchaggs function to return the constructed aggregations:

export const modifiedOpensearchaggs = () => ({
  // ... (keep existing properties)

  async fn(input, args, { inspectorAdapters, abortSignal }) {
    // ... (keep existing logic)

    // Return the constructed aggs using toDsl
    const constructedAggs = aggs.toDsl(args.metricsAtAllLevels);

    return constructedAggs;
  }
});

3. Data Transformation

Data transform in vega is done by transform. What it does is similar to tabifyAggResponse, which is aim to flatten nested structures for visualization. The main difference in approach is that tabifyAggResponse creates a complete tabular representation of the data, while the Vega transform provides a series of steps to transform the data on-the-fly during visualization rendering. This makes the Vega approach more memory-efficient and potentially faster for large datasets, as it doesn't need to materialize the entire flattened dataset in memory. Here is more comparation:

Here we will add two utility functions

function parseAggStructure(aggs: any, path: string[] = []): any {
  const result: any = {};

  for (const [key, value] of Object.entries(aggs)) {
    if (key === 'buckets' && Array.isArray(value)) {
      result.buckets = value[0]; // Take the first bucket as a sample
    } else if (typeof value === 'object' && value !== null) {
      result[key] = parseAggStructure(value, [...path, key]);
    } else {
      result[key] = value;
    }
  }

  return result;
}

function generateTransform(aggStructure: any, aggs: any): any[] {
  const transform: any[] = [];
  const flattenStack: string[] = [];

  function getFieldName(aggId: string): string {
    return aggs[aggId]?.terms?.field || aggs[aggId]?.date_histogram?.field || aggId;
  }

  function traverse(obj: any, path: string[] = [], depth: number = 0) {
    for (const [key, value] of Object.entries(obj)) {
      if (key === 'buckets') {
        const parentKey = path[path.length - 1];
        const fieldName = getFieldName(parentKey);

        if (parentKey !== '2') { // Skip the top-level bucket agg
          transform.push({ calculate: `datum['${parentKey}']['buckets']`, as: fieldName });
          transform.push({ flatten: [fieldName] });
          flattenStack.push(fieldName);

          // Add key calculation
          transform.push({ calculate: `datum['${fieldName}'].key`, as: fieldName });
        } else {
          // For bucket agg, use 'key' directly
          transform.push({ calculate: "datum['key']", as: fieldName });
        }

        traverse(value, [...path, key], depth + 1);

        if (parentKey !== '2') {
          flattenStack.pop();
        }
      } else if (typeof value === 'object' && value !== null) {
        traverse(value, [...path, key], depth + 1);
      } else if (key === 'value' && depth === Object.keys(aggStructure).length - 1) {
        // Only add metric calculation for the deepest level
        const metricKey = path[path.length - 2];
        const fieldName = aggs[metricKey]?.avg?.field || `${metricKey}_value`;
        const parent = flattenStack[flattenStack.length - 1] || 'datum';
        transform.push({ calculate: `${parent}['${metricKey}']['value']`, as: `avg_${fieldName}` });
      }
    }
  }

  traverse(aggStructure);
  return transform;
}

Use these functions in the Vega utility functions in the next sub-section:

const buildTransforms = (aggs: any) => {
  const aggStructure = parseAggStructure(aggs);
  return generateTransform(aggStructure);
};

Example Result: Given the following aggregation:

"aggs": {
  "2": {
    "date_histogram": {
      "field": "timestamp",
      "fixed_interval": "12h",
      "time_zone": "America/Los_Angeles",
      "min_doc_count": 1
    },
    "aggs": {
      "3": {
        "terms": {
          "field": "geo.dest",
          "order": { "_count": "desc" },
          "size": 5
        },
        "aggs": {
          "1": {
            "avg": { "field": "bytes" }
          }
        }
      }
    }
  }
}

The datum structure would be:

{
  "3": {
    "buckets": [
      {
        "1": { "value": 5069.333333333333 },
        "key": "CN",
        "doc_count": 3
      },
      // ... other buckets
    ]
  },
  "key_as_string": "2024-06-30T12:00:00.000-07:00",
  "key": 1719774000000,
  "doc_count": 23
}

The generated transform would be:

[
  {
    "calculate": "datum['key']",
    "as": "timestamp"
  },
  {
    "calculate": "datum['3']['buckets']",
    "as": "geo.dest"
  },
  {
    "flatten": ["geo.dest"]
  },
  {
    "calculate": "datum['geo.dest'].key",
    "as": "geo.dest"
  },
  {
    "calculate": "datum['geo.dest']['1']['value']",
    "as": "avg_bytes"
  }
]

4. Create Vega Utility Functions

Create utility functions in a separate file:

// vegaUtils.ts

export const buildDataUrl = (indexPattern: string, timeField: string, aggs: any) => {
  return {
    context: true,
    timefield: timeField,
    index: indexPattern,
    body: {
      aggs: aggs,
      size: 0,
    },
  };
};

export const buildTransforms = (metrics: any[], buckets: any[]) => {
  // Implementation of buildTransforms logic
};

export const buildEncoding = (metrics: any[], buckets: any[], fieldsMap: any) => {
  // Implementation of buildEncoding logic
};

export const buildVegaSpec = (state: VisualizationState) => {
  const { vega } = state;
  const dataUrl = buildDataUrl(vega.specBuilder.indexPattern!, vega.specBuilder.timeField, vega.specBuilder.aggs);
  const transforms = buildTransforms(vega.specBuilder.metrics, vega.specBuilder.buckets);
  const encoding = buildEncoding(vega.specBuilder.metrics, vega.specBuilder.buckets, vega.specBuilder.fieldsMap);

  return {
    $schema: "https://vega.github.io/schema/vega-lite/v5.json",
    data: { url: dataUrl },
    transform: transforms,
    mark: { type: vega.specBuilder.type, point: true },
    encoding: encoding,
  };
};

5. Update toExpression Method

Modify the toExpression method to use the new utility functions:

const toExpression = async (params) => {
  const state = store.getState().visualization;
  if (state.vega.useVegaRendering) {
    const vegaSpec = buildVegaSpec(state);
    let vis = await createVis('vega', state.activeVisualization!.aggConfigParams, state.indexPattern!, params.searchContext);
    vis.params = {
      spec: JSON.stringify(vegaSpec),
    };

    const vega_expression = await buildPipeline(vis, {
      timefilter: params.timefilter,
      timeRange: params.timeRange,
      abortSignal: undefined,
      visLayers: undefined,
      visAugmenterConfig: undefined,
    });
    return vega_expression;
  }
  // ... (existing non-Vega rendering logic)
};

Method 2: Construct Aggs

Method 2 follows a similar structure to Method 1, but instead of passing the whole aggregation, it constructs the aggregation from individual components (metrics, segment, group, split). The main difference lies in the setVegaAggs reducer and the buildVegaSpec utility function:

// In the visualization slice
setVegaAggs: (state, action: PayloadAction<{metrics: any[], segment: any[], group: any[], split: any[]}>) => {
  const { metrics, segment, group, split } = action.payload;
  state.vega.specBuilder.metrics = metrics;
  state.vega.specBuilder.segment = segment;
  state.vega.specBuilder.group = group;
  state.vega.specBuilder.split = split;
  // Construct aggs from these components
  state.vega.specBuilder.aggs = constructAggs(metrics, segment, group, split);
},

// In vegaUtils.ts
export const constructAggs = (metrics: any[], segment: any[], group: any[], split: any[]) => {
  // Logic to construct aggs from individual components
};

Method 3: Passing Formatted Data to Vega Spec

This method involves passing pre-formatted data directly to the Vega spec. This method requires modifications to the buildVegaSpec function:

// In vegaUtils.ts
export const buildVegaSpec = (state: VisualizationState, formattedData: any[]) => {
  return {
    $schema: "https://vega.github.io/schema/vega-lite/v5.json",
    data: { values: formattedData },
    // ... other spec properties
  };
};

// In the component where the Vega spec is created
const formattedData = await getFormattedDataFromOpensearchaggs(/* params */);
const vegaSpec = buildVegaSpec(state, formattedData);

3. Pros and Cons

Method 1: Passing Whole Aggregation

Method 2: Construct Aggs

Method 3: Passing Formatted Data

Conclusion

After considering all three methods, we decide proceeding with Method 1: Passing Whole Aggregation. This approach offers the best balance between maintaining consistency with existing OpenSearch Dashboards structures and providing efficient handling of complex aggregations. It avoids the potential scalability and performance issues of Method 3 while being less complex to implement and maintain than Method 2. Method 1 aligns well with the current OpenSearch Dashboards architecture and will likely provide the smoothest integration path for Vega visualizations within the existing framework. It also leaves room for future optimizations and extensions if needed.

How to Test / How to Make the Transfer Robust

To ensure the robustness and accuracy of the VegaSpecBuilder implementation, we should create a series of test cases that cover various combinations of metrics and buckets. These test cases will help verify that the VegaSpecBuilder can correctly handle different visualization configurations.

Test Cases

Future Extension Discussion

Supporting Multiple Query Languages (DQL, PPL, SQL)

Extend the VegaSpecBuilder to handle different query languages:

buildPPlQuery() {
   this.pplQuery = ...
}

buildPPLQuerySpec(pplQuery) {
  return {
    data: {
      url: {
        index: this.indexPattern.title,
        body: {
          query: {
            source: {
              query: this.pplQuery,
            },
          },
          size: 0,
        },
      },
      format: this.format
    },
  };
}

buildSQLQuerySpec(sqlQuery) {
  ...
}

buildWithQuerySpec(queryType = 'dql', query = '') {
  let dataUrl;
  if (queryType === 'dql') {
    dataUrl = this.buildDataUrl();
  } else if (queryType === 'ppl') {
    dataUrl = this.buildPPLQuerySpec(query);
  } else if (queryType === 'sql') {
    dataUrl = this.buildSQLQuerySpec(query);
  }
  return build(this.data)
}

Handling Multiple Queries and Data Sources

Handle multiple queries and data sources by extending the buildVegaSpec method:

buildMultiQuerySpec(queries) {
  this.dataWithMultipleQuery = this.queries.map((query, index) => ({
    name: `data${index + 1}`,
    url: {
      index: this.indexPattern.title,
      body: {
        query: query.format === 'ppl' ? {
          source: {
            query: this.buildPPLQuery(),
          },
        } : {
          sql: {
            query: this.buildSQLQuery(),
          },
        },
        size: 0,
      },
    },
    format: this.format
  }));

  return build(this.dataWithMultipleQuery)

2.16 Timeline and Task BreakDowns

FAQ

YANG-DB commented 2 weeks ago

@ananzh very nice ! I would add another important capability is to allow the community to contribute generic vis-tool as part of the out of the box vis tools catalog

YANG-DB commented 2 weeks ago

I strongly recommend reviewing the vega-altair engine used to do this same transformation from a high level language (python) into the vega spec (json)

YANG-DB commented 2 weeks ago

Another suggestion is to integration the existing opensource vega-editor to replace our existing vega json editor to simplify the actual vega editing for advanced vis- builders

ashwin-pc commented 2 weeks ago

zooming in and out

This exists in the tool today.

Toggle in VisBuilder to allow user to display vislib vis or vega vis in VisBuilder, to save as vislib vis or vega vis and to embed either vislib vis or vega vis in Dashboard .

We should not have a toggle in the UI since for most users Vega is an implementation detail. Only advanced users would care about it. If we want to maintain the expereince for users, we should either try to match the experience or keep an advanced settings toggle to allow the user to go back to the older expereince.

A new vega type vis directly in VisBuilder

Why do we need this as opposed to just redirecting the user to the vega editor? if we do it this way, we should allow the user to switch back and carry context from vega back to the other chart types. Right now if i switch between line and bar and go back to line, the line chart carries over the changes that t can from the bar chart. With this vega type can we do that?

VegaSpecBuilder Class

In this class you are also constructing the query but its very secific to DSL. how would this work with PPL and SQL? They each support a limited subset aggregations and does not support all the agg types.

Supporting Multiple Query Languages (DQL, PPL, SQL)

if we arent integrating VisBuilder into Discover, we might not need this. Would like to hear from the others about this, but my reasoning is that the user never has to enter the query that is used to fetch the data from the backend. If thats the case, the language we use under the hood does not matter. The only exception to this being datasources that dont support visualizations in other languages. In that scenario id like this to be a little more modular so that when other languages are added, its not on the VisType to manually update itself to support all the new languages.

One approach here could be to allow the VisType to specify which languages it supports so that they all have to support DQL by default but can optionally specify which other languages they support. But what would be even nicer is if the VisType did not have to know anything about the language used under the hood and only worried about the dataframe that cameback and mapped it to the Vis, leaving the query language part to the framework. But this might be trickier

virajsanghvi commented 2 weeks ago
virajsanghvi commented 2 weeks ago
ananzh commented 2 weeks ago

A hard code mapping for demo purpose

export const createVegaSpec = (styleState, dimensions, valueAxes, aggConfigs, indexPattern, searchContext) => {
  const { addLegend, addTooltip, type } = styleState;
  const { x, y } = dimensions;
  const index = indexPattern.title;
  const timeField = searchContext.timeRange ? searchContext.timeRange.field : "@timestamp"; // Use the time range field or default to "@timestamp"

  const dateHistogram = aggConfigs.aggs.find(agg => agg.schema === 'segment');
  const metric = aggConfigs.aggs.find(agg => agg.schema === 'metric');
  const metricType = metric.type.name;

  const dataUrl = {
    context: true,
    timefield: timeField,
    index: index,
    body: {
      aggs: {
        1: {
          date_histogram: {
            field: dateHistogram.params.field.displayName,
            fixed_interval: "3h", // hard coded for now
            time_zone: "America/Los_Angeles", // can be dynamic if required
            min_doc_count: dateHistogram.params.min_doc_count,
            extended_bounds: dateHistogram.params.extended_bounds,
          },
          aggs: {
            2: {
              [metricType]: {
                field: metric.params.field.displayName
              }
            }
          }
        }
      },
      size: 0
    }
  };

  const vegaSpec = {
    $schema: "https://vega.github.io/schema/vega-lite/v5.json",
    data: {
      url: dataUrl,
      format: {
        property: "aggregations.1.buckets"
      }
    },
    transform: [
      {
        calculate: "datum.key",
        as: "timestamp"
      },
      {
        calculate: `datum[2].value`,
        as: metric.params.field.displayName
      }
    ],
    layer: [
      {
        mark: {
          type: "line" // or dynamic type if needed
        }
      },
      {
        mark: {
          type: "circle",
          tooltip: addTooltip
        }
      }
    ],
    encoding: {
      x: {
        field: "timestamp",
        type: "temporal",
        axis: {
          title: timeField
        }
      },
      y: {
        field: metric.params.field.displayName,
        type: "quantitative",
        axis: {
          title: metric.params.field.displayName
        }
      },
      color: {
        datum: metric.params.field.displayName,
        type: "nominal"
      }
    }
  };

  if (addLegend) {
    vegaSpec.encoding.color.legend = {
      title: metric.params.field.displayName
    };
  }

  return vegaSpec;
};
virajsanghvi commented 1 week ago

Can you speak to the difference of the options? I'm not really sure from reading

From method 1: cons

which might not be flexible for dynamic changes.

Are there specific cases you're worried about?

we should create a series of test cases that cover various combinations of metrics and buckets

Just to be clear, we should have test cases for all known combinations, right? And can we prevent unknown combos from being used in the product in some way?

Also, do we clearly understand the expected input/output of these cases?

VegaSpecBuilder

Should we be storing unserializable state in redux?

Also, building the spec is calculated state, is this the right thing to store?

ashwin-pc commented 1 week ago

Create a vega slice

Why do we need a slice? slices are for state that needs to be stored globally and accessed across the app. The Vega spec is only needed by the Visualization right? cant we just create the spec there?

Send modular API to update VegaBuilder Class

Do we need to update both the slice and the aggconfig? or can we update just the aggconfig? My assumption was that the spec could be constructed whenever we want using the style state and the agg config.

Separate buckets Both methods need to separate bucket aggregations into distinct categories: group, split, and segment. This separation is necessary because each type of aggregation serves a different purpose in the visualization:

Can you give a little more details about this. Not sure i fully understood why we need this.

VegaSpecBuilder

How does this work for different Vistypes? dont the encodings and specs change between vistypes? e.g. pie and bar chart will encode the chart differently right?

const vegaSpecBuilder = useTypedSelector(state => state.vega.specBuilder);

State should not be used to retrieve a function. Why cant vegaSpecBuilder be a simple function?

The Difference

In this section i didnt understand the difference between the two methods. What is method 2? I didnt understand the pro's and cons of each approach to know which ones better. An example might help.'

Overall, the approach here could benifit from a block diagram explaining how the flow works as the information is passed across the various components

anirudha commented 3 days ago

| if we arent integrating VisBuilder into Discover

How will sql/ ppl users build visualization?

How will discover IA for visualizations be handled with multiple languages support ?

How will we achieve the cohesion tenet without sql / ppl support for visualizations