opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.86k stars 1.83k forks source link

[Streaming Indexing] Introduce bulk Protofobuf API streaming flavour #15447

Open reta opened 3 months ago

reta commented 3 months ago

Is your feature request related to a problem? Please describe

Is your feature request related to a problem? Please describe. The bulk HTTP API does not support streaming (neither HTTP/2 nor chunked transfer)

Describe the solution you'd like Introduce bulk Protobuf API streaming flavour (see please https://github.com/opensearch-project/OpenSearch/issues/9070#issuecomment-2307452157) based on new experimental transport (https://github.com/opensearch-project/OpenSearch/issues/9067)

Describe alternatives you've considered N/A

Additional context See please https://github.com/opensearch-project/OpenSearch/issues/9067

Introduce efficient (binary?) format for streaming ingestion

Alternative option (to https://github.com/opensearch-project/OpenSearch/issues/9070) is to introduce new efficient (binary?) format for streaming ingestion (for example, based on Protocol Buffers).

Protocol Buffers are great for handling individual messages within a large data set. Usually, large data sets are a collection of small pieces, where each small piece is structured data. - https://protobuf.dev/programming-guides/techniques/

The example message schema may look like this:

syntax = "proto3";
import "google/protobuf/any.proto";

message Index {
  optional string index = 1;
  optional string _id = 2;
  optional bool require_alias = 3;
  map<string,  google.protobuf.Any> fields = 4;
}

message Create {
  optional string index = 1;
  optional string _id = 2;
  optional bool require_alias = 3;  
  map<string,  google.protobuf.Any> fields = 4;
}

message Delete {
  optional string index = 1;
  string _id = 2;
  optional bool require_alias = 3;      
}

message Update {
  optional string index = 1;
  string _id = 2;
  optional bool require_alias = 3;    
  optional google.protobuf.Any doc = 4;
}

message Action {
  oneof action {
      Index index = 1;
      Create create = 2;
      Delete delete= 3;
      Update update = 4;
  }
}

The schema actively relies on google.protobuf.Any to pass freestyle JSON-like structures around (for example, documents or scripts):

The Any message type lets you use messages as embedded types without having their .proto definition. An Any contains an arbitrary serialized message as bytes, along with a URL that acts as a globally unique identifier for and resolves to that message’s type. - https://protobuf.dev/programming-guides/proto3/#any

Risks to consider:

Related component

Indexing

Describe alternatives you've considered

Stay on HTTP APIs only (https://github.com/opensearch-project/OpenSearch/issues/9070)

Additional context

See please https://github.com/opensearch-project/OpenSearch/issues/9067

msfroh commented 3 months ago

The schema actively relies on google.protobuf.Any to pass freestyle JSON-like structures around (for example, documents or scripts):

I've seen two other options used to pass around documents when using Protobuf in search use-cases:

  1. Fields are a list of key-value pairs.
    1. Keys are strings (since they're the field names).
    2. The values may either be strings (which get parsed to other primitive types based on mapping) or may be a union type to support passing numbers as numbers (which is still trickier than JSON, since you potentially need to support multiple number types). The union type could support lists, or you just represent lists as k-v pairs with a repeated key.
    3. You can support object nesting by either allowing a field value to be a Document (where Document is the type with the fields), or you could have a separate k-v list for nested documents. (To be fair, I think I've only seen the separate list when nested objects were added later.)
    4. A document is an opaque (to Protobuf) byte array, which would probably just be a JSON string encoded as UTF-8.
      1. Doing "JSON over Protobuf" probably loses a lot of the advantage of Protobuf, but it's very easy.

I'm still not sure what solution I'd like to see, but wanted to document those options.