opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.83k stars 1.83k forks source link

Add support for ingesting and returning data in the Apache Arrow format #16664

Open jayaskren opened 3 days ago

jayaskren commented 3 days ago

Is your feature request related to a problem? Please describe

At work, I am making dashboards related to the work all of our servers are processing at any given time. I am severely limited by how many data points I can add to a given dashboard. Either I need to filter the number of servers, limit the the time window to days or even hours, or severely aggregate the data so we lose all context. I have a prototype using Vega doing what I want to do at work, but our dashboard cannot handle that much data even though Vega doesn't have a problem with it and the dashboard supports Vega.

Describe the solution you'd like

I want to be able to scale up my visualizations. If OpenSearch offered the ability to return data in the Apache Arrow format, we could handle a lot more data on the frontend via Vega. Other visualization technologies on the frontend could also potentially take advantage of Apache Arrow. Here is a discussion of using Vega with Apache Arrow: https://observablehq.com/@theneuralbit/introduction-to-apache-arrow

While we are at it, it shouldn't be difficult to add the ability to ingest Apache Arrow data while I am at it. I am happy to work on the implementation especially if someone can point me architecturally to where the code would go and what interfaces it would need to implement.

Related component

Search:Performance

Describe alternatives you've considered

I have built my own custom visualizations, but it would be nice if opensearch could handle this out of the box rather than me needing to go to another tool

Additional context

Although it uses different technology, here is a prototype of the idea that I built several years ago: https://d2xis0feu0l7hz.cloudfront.net/index.html

It uses D3 on the frontend and my own columnar format as the data format. As a POC, I was able to display a table of 10 million rows of finance data. I also have a chart in which D3 aggregates all 43 million rows of data. For comparison, Excel has a limit of 1 million rows, and Google Docs has a limit of 10 million cells. Apache Arrow should be a good replacement for my columnar format to make the solution more standard.