opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.46k stars 1.74k forks source link

[RFC] gRPC-based API for Search #15190

Open amberzsy opened 1 month ago

amberzsy commented 1 month ago

Is your feature request related to a problem? Please describe

Inspiration

Per effort of https://github.com/opensearch-project/OpenSearch/issues/6844 and benchmarking result (https://github.com/opensearch-project/OpenSearch/issues/10684#issuecomment-1876077885) (~20%), we can consider step further on adding support on gRPC-based API with protobuf as serializing/de-serializing. To validate our assumption on potential performance gain over protobuf which should be more efficient and compact compare to JSON, we performed PoC for client <> server protobuf on Search API with specific query types and we are able to see promising result from https://github.com/opensearch-project/opensearch-clients/issues/69.

Proposal

With ongoing effort for node-to-node communication, which focuses more on Transport Layer with implementing StreamInput, StreamOutput with protobuf serializer/de-serializers. We can expand the effort and have client <> server protobuf support in parallel to achieve more significant performance gain.

The proto definition for search API and partial overlap with transport layer should follow opensearch-api-specification which is widely adopted by clients.

For server side change there are two options here:

  1. Introduce new content-type and expose option to end-user send and receive protobuf binary payloads. Pros: faster development cycle to begin with as potentially the extension on existing searchRequest/Response, builder XContent. Cons: potentially introduce significant code refactoring which introduces complexity alongside the development.

  2. Implement new streaming-style search API(gRPC) using protobuf and expose new grpc endpoint for search API. Pros:
    a) gRPC natively supports client-side, server-side, and bidirectional streaming, allowing for real-time communication. This is more efficient than HTTP/1.1 used by REST b) generates client and server code in multiple programming languages based on the proto files. This reduces boilerplate code and ensures consistency across different languages and platforms. c) less code refactoring Cons: a) the development cycle might not as fast as approach 1. b) Though bringing up new grpc service and hook with the internal transport layer might not be too complicated, there will be unknowns on the overall integration with existing ecosystem, e.g related plugins (security, knn, sql, some other monitoring etc).

For client (Java, Go, Python etc), would have support to optionally use new protobuf-based server API with minimal changes (i.e. no need to rewrite an application already using the client)

Next Steps

  1. Generate proto from opensearch-api-specification (refer: https://github.com/nytimes/openapi2proto)
  2. bootstrap / create gRPC SearchService (SearchGRPCService) and hook with internal layer (clusterservice, actionlisterner etc)
  3. grpcHandlers for searchAction: add grpc/action/search and register in ActionModule
  4. There are ~ 40+ queryBuilder/types, need to target on knn related as . (? CorrelationQuery)
  5. ?? integrate with transport layer protobuf implementation (node-to-node)

Timeline

2.17 release: (09/03/2024 ~ 09/17/2024) [Experimental Feature]

  1. protobuf definitions
  2. simple matchAll query for E2E poc.
  3. feature will be marked as experiment.

Related

Transport layer Protobuf support: https://github.com/opensearch-project/OpenSearch/issues/6844

getsaurabh02 commented 1 month ago

Thanks @amberzsy for the proposal. Should we also highlight the abstraction the new 'gRPC SearchService' under an 'Experimental Flag' for the proposed timeline of this feature?

dblock commented 1 month ago

For client (Java, Go, Python etc), would have support to optionally use new protobuf-based server API with minimal changes (i.e. no need to rewrite an application already using the client)

I really like this. Do I understand correctly that the stated goal of this implementation is that a user can switch from REST/HTTP/application/(nd)json to HTTP2/grpc/protobuf via a configuration option on the client (and then it just works(TM) for all APIs)?

amberzsy commented 1 month ago

For client (Java, Go, Python etc), would have support to optionally use new protobuf-based server API with minimal changes (i.e. no need to rewrite an application already using the client)

I really like this. Do I understand correctly that the stated goal of this implementation is that a user can switch from REST/HTTP/application/(nd)json to HTTP2/grpc/protobuf via a configuration option on the client (and then it just works(TM) for all APIs)?

correct. some lightweight translator/adaptor would be needed.

reta commented 3 weeks ago

@amberzsy @dblock I have two questions please:

  1. 3.x comes with HTTP/2 support (clients + server) out of the box, what are the tangible benefits of using gRPC here?
  2. 2.x does not support HTTP/2 (server side) nor have any client libraries that could handle that (AHC 4.x does not support HTTP/2), what is our plan here?
dblock commented 3 weeks ago

Re: benefits I expect grpc + protobuf to improve both performance and throughput over HTTP/2 JSON. You're right to call this out though, @amberzsy were your benchmarks using HTTP/2?

andrross commented 3 weeks ago

@dblock The previous benchmarks for the REST API were just sending binary protobuf blobs over the HTTP/1.1 protocol. It essentially showed that parsing protobuf was more performant than XContent parsing JSON (no surprise there). I expect any solution that is able to replace XContent parsing with protobuf to show performance improvements. I don't know if gRPC would show better performance when compared to any other HTTP/2-based solution that sent protobuf blobs but I think it is worth experimenting with some prototypes.

reta commented 3 weeks ago

I don't know if gRPC would show better performance when compared to any other HTTP/2-based solution that sent protobuf blobs but I think it is worth experimenting with some prototypes.

Thanks @andrross , this is exactly what we need to figure out: tangible benefits of using gRPC vs HTTP/2 + JSON (since this RFC specifically focuses on gRPC and not HTTP/1.1 + Protobuf). Thank you.

amberzsy commented 3 weeks ago

e: benefits I expect grpc + protobuf to improve both performance and throughput over HTTP/2 JSON. You're right to call this out though, @amberzsy were your benchmarks using HTTP/2?

with HTTP/1.

@amberzsy @dblock I have two questions please:

  1. 3.x comes with HTTP/2 support (clients + server) out of the box, what are the tangible benefits of using gRPC here?
  2. 2.x does not support HTTP/2 (server side) nor have any client libraries that could handle that (AHC 4.x does not support HTTP/2), what is our plan here?

gRPC uses http/2 as it's transfer protocol plus it has build-in protobuf support as its default serialization format. From the benchmark of both client-server and node-to-node, we've seen perf gain on adopting protobuf and replacing xContent parser logic. i think with HTTP/2 (http/2 + json) alone might not achieve similar improvement. possibly with http/2 + proto, though not sure if it's commonly adopted. Since i guess we need to manually handle the serialization, deserialization, and method invocation across different languages, which adds complexity. Beyond, gRPC also provide abstraction and simplified development which it abstracts the underlying communication details and client generated from grpc for free in multiple programming languages, which reduce the boilerplate. I guess we might need to manually write and maintain such in Http/2. It also provides other benefits in terms of built-in support for features like load balancing, distributed tracing, and authentication.

prudhvigodithi commented 2 weeks ago

Thanks for the proposal @amberzsy, just went through some of the OpenSearch issue links that talks about Protobuf implementation. https://github.com/opensearch-project/OpenSearch/issues/6844 https://github.com/opensearch-project/OpenSearch/issues/15190

Regarding this RFC proposal, the input and output will be in Protobuf binary format (including the streaming-style search API with a gRPC endpoint). For OpenSearch users, to ensure that the API behavior remains unchanged, is there a plan to implement a generic interface that converts Protobuf messages back to a JSON-friendly format for output? Additionally, could this interface be used to read user input as JSON and convert it back to Protocol Buffers? Thank you

andrross commented 2 weeks ago

@prudhvigodithi

For OpenSearch users, to ensure that the API behavior remains unchanged, is there a plan to implement a generic interface that converts Protobuf messages back to a JSON-friendly format for output?

There are no plans to remove the existing JSON APIs.