[RFC] JSON-to-JSON Transformer

jackiehanyang commented 8 months ago

Is your feature request related to a problem? Please describe

Flow-Framework aims to make OpenSearch the easiest destination for building AI/ML applications on vector databases by differentiating OpenSearch on ease-of-use with high-flexibility to deliver an edge in the emerging and highly competitive vector database landscape. Flow-Framework aims to revamp how we build AI/ML flows so that it can support any AI use case suited for OpenSearch. This requires us to provide customers with an innovative paradigm that allows users to compose AI-augmented workflows using modular and re-useable search and ingest processors that can represent any relevant AI use case. We will provide users with the flexibility to configure these processors by introducing a JSON-to-JSON Transformer. This tool will allow customers to transform the processor input or output datasets, enabling them to seamlessly chain them together.

The JSON-to-JSON transformer functions as a standalone utility within the Core package. It enables users to configure transformations from one or multiple JSONs format to another, such as converting input JSON objects(e.g., search results from a previous flow step) into a different JSON format like a prompt template. It offers three approaches for data transformation: the Painless Script (P0 item), string manipulation function JSONPath (P0 item), and automated transformation based on specified inputs and outputs (P1 item). This utility should be stand alone and can be integrated into any processor, either before or after the processor execution flow, as a data transformation step.

j-j-1 drawio

Describe the solution you'd like

Providing a public utility method in core package that can be used by any processor. Depends on future requirement, we can expose this utility method to a REST API, or even a processor.

public static JsonNode JsonDataTransformation(List<JsonNode>, 
                                              DataTransformApproach approach, 
                                              List<String> source) {
   ...
}

List<JsonNode>, the dataset that needs to perform transform on. Usually it’s a list of SearchHits object.
DataTransformApproach approach, Enum PAINLESS, or Enum JSONPATH, the approach customer would like to use to transform the dataset.
List<String> source, the painless script source, or JSONPath field mapping instruction

Supported Transform Approach 1. Painless Script

Painless is a performant, secure scripting language that provides numerous capabilities. Writing Painless Scripts can be challenging for customers, and we aim to eliminate that difficulty. However, we still want to maintain this method as the default approach, allowing customers to achieve their objectives when string manipulation function JSONPath are not enough.

Supported Transform Approach 2. String Manipulation (JSONPath)

JSONPath is a query language designed for navigating and extracting parts of a JSON document. With JSONPath, you can specify and navigate to different parts of a JSON structure, making it easier to retrieve specific data elements without needing to process the entire structure manually in code.

AppSec has been clear for using JSONPath in ml-commons since 2.12. Will initiate another AppSec for this use case.

2.1. N-1 Transform: Merge multiple JSONs into one JSON or other format of data

In some cases, the transform has to be applied in a “many-to-one” mode by transforming all multiple objects like search results into a single JSON output. For instance, a re-ranker type mode may require the incoming search results (hits.fields) to be collapsed into a single array of strings as input into a re-ranker (eg. Cohere ReRank)

For example, when customer has the following

[
    {
        "hits": [
            {
                "_index": "media_library",
                "_id": "63MhYY0BFJSF4M0W0eUG",
                "_score": 1,
                "_source": {
                    "books": {
                        "name": "To Kill a Mockingbird",
                        "author": "Harper Lee",
                        "genres": "fiction",
                        "price": 15.99
                    },
                    "songs": {
                        "name": "Pocketful of Sunshine"
                    }
                }
            }
        ]
    },
    {
        "hits": [
            {
                "_index": "books_songs",
                "_id": "5nMhYY0BFJSF4M0W0eUG",
                "_score": 1,
                "_source": {
                    "books": {
                        "name": "Where the Crawdads Sing",
                        "author": "Delia Owens",
                        "genres": "fiction",
                        "cost": 12.99,
                        "year": 2018
                    },
                    "songs": {
                        "name": "If"
                    }
                }
            }
        ]
    }
]

Customer will need to provide the following JSONPath transform instruction

{
    "book_name": "$[*].hits[*]._source.books.name",
    "song_name": "$[*].hits[*]._source.songs.name"
}

The output would be

{
 "book_name" : ["To Kill a Mockingbird", "Where the Crawdads Sing"]
 "song_name" : ["Pocketful of Sunshine", "If"]
}

2.2. 1-1 Transform: Map a specific field in one JSON to another JSON 1-1 Transform is essentially the same as an N-1 Transform, with the distinction being that in a 1-1 Transform, N equals 1. Therefore, we don't need a separate DataTransformApproach Enum to differentiate between 1-1 and N-1 Transforms. However, for an 1-N Transform scenario, customers would need to use a painless script, as JSONPath may not be sufficient for such transformations.

Related component

Other

Describe alternatives you've considered

No response

Additional context

No response

navneet1v commented 8 months ago

@jackiehanyang do we know the impact on JVM and latency for using a JSON to JSON transform(using painless and Json path) on a search response(taking reference of search response as that tends to be in general quite big) containing lets say 100 to 1000 results?

It would be good if we can have some micro-benchmarks done on this to understand the impact of this transform.

jackiehanyang commented 8 months ago

@jackiehanyang do we know the impact on JVM and latency for using a JSON to JSON transform(using painless and Json path) on a search response(taking reference of search response as that tends to be in general quite big) containing lets say 100 to 1000 results?

It would be good if we can have some micro-benchmarks done on this to understand the impact of this transform.

Will share the result once I have it

smacrakis commented 8 months ago

Would it make sense to use a binary format (e.g., Protobuf, Thrift, CBOR?) for communication between processors? It seems perversely inefficient to serialize/deserialize JSON multiple times in a pipeline. Does the ongoing work on Protobuf in OpenSearch help here?

arjunkumargiri commented 8 months ago

Thanks for building this functionality, couple of follow up questions:

Is the expectation of this transformer only to perform data manipulation?Painless Script supports multiple scripting functionalities in addition to data manipulation. By adding support for painless script users can make use of transformer to perform non data manipulation operations.
Why does output need to include document ID? Default JSONPath does not include ID. Also for non search document input this approach would not work.

jackiehanyang commented 8 months ago

Would it make sense to use a binary format (e.g., Protobuf, Thrift, CBOR?) for communication between processors? It seems perversely inefficient to serialize/deserialize JSON multiple times in a pipeline. Does the ongoing work on Protobuf in OpenSearch help here?

I do agree using a binary format for communication between processors is more efficient than serializing/deserializing JSON. Communicating with Dylan to see if Protobuf is something we should consider

jackiehanyang commented 8 months ago

Thanks for building this functionality, couple of follow up questions:

Is the expectation of this transformer only to perform data manipulation?Painless Script supports multiple scripting functionalities in addition to data manipulation. By adding support for painless script users can make use of transformer to perform non data manipulation operations.

Why does output need to include document ID? Default JSONPath does not include ID. Also for non search document input this approach would not work.

Yes, this transformer only perform data manipulation. It won't modify any data value
Because when merging multiple documents into one document, we need a way to differentiate json key names. If it's a non search document, we will need to append some GUID to it to make the key name unique

msfroh commented 8 months ago

Does this belong in https://github.com/opensearch-project/flow-framework ?

It doesn't seem to be related to OpenSearch core.

jackiehanyang commented 7 months ago

However, for an N-1 Transform scenario, customers would need to use a painless script, as JSONPath may not be sufficient for such transformations.

We're planning to develop this as a standalone utility function within the core repository. This will allow each processor that requires pre/post data transformation to call this function instead of integrating it as a processor or workflow step within the flow-framework. This approach aims to reduce dependency coupling and limitations when transitioning to serverless.

arjunkumargiri commented 7 months ago

Did you explore other json to json transformer libraries such as jolt: https://github.com/bazaarvoice/jolt .

dylan-tong-aws commented 7 months ago

Would it make sense to use a binary format (e.g., Protobuf, Thrift, CBOR?) for communication between processors? It seems perversely inefficient to serialize/deserialize JSON multiple times in a pipeline. Does the ongoing work on Protobuf in OpenSearch help here?

Performance is definitely important, and there are many possible solutions. I was under the impression that the transformations are performed between Java (JSON) objects, and that deserialization/serialization is not required.

@jackiehanyang is serialization/deserialization required?

dylan-tong-aws commented 7 months ago

The JSONPath option will need to be complemented with some helper String manipulation functions for it to be useful for a broader range of use cases.

I recommend reviewing the pre/post processors that were implemented for the AI connectors and identifying the transform logic that can't be translated into JSONPath. I believe there are string manipulation cases like escaping strings.

@ylwu-amzn, @zane-neo, and @Zhangxunmt should be able to help identify this gap.

dylan-tong-aws commented 7 months ago

@jackiehanyang, can you provide an example of how the interface for this functionality might look like within a processor?

@mingshl, @ylwu-amzn, this is the data transform functionality that can replace the pre/post processing functionality that current exists in the AI connectors. Would be good to see a proposal of how this functionality is interfaced through the ML inference processor (search pipelines).

peternied commented 7 months ago

[Triage - attendees 1 2 3 4 5 6] @jackiehanyang Thanks for creating this RFC looking forward to seeing how this lands.

msfroh commented 7 months ago

I would just like to point out that no (ingest or search) processors in OpenSearch operate on JSON. They operate on (Java) objects. JSON is just a notation to represent objects (originally for Javascript).

If the goal is to support JSONPath as a language to manipulate objects, one option could be to add JSONPath as a scripting language supported by the OpenSearch scripting engine, then use the script processor with JSONPath as a script.

mingshl commented 7 months ago

@jackiehanyang, can you provide an example of how the interface for this functionality might look like within a processor?

@mingshl, @ylwu-amzn, this is the data transform functionality that can replace the pre/post processing functionality that current exists in the AI connectors. Would be good to see a proposal of how this functionality is interfaced through the ML inference processor (search pipelines).

ml inference processor now support json path, if this json to json transform is to use painless script. Using a script processor and chain with ml inference processor will serve the same purpose.

b4sjoo commented 6 months ago

Overall this looks good to me. The only caution would be the customer's input data sanction, if not done properly, could cause DoS attack.

andrross commented 6 months ago

The JSON-to-JSON transformer functions as a standalone utility within the Core package. It enables users to configure transformations from one or multiple JSONs format to another...

Is there any coupling to classes or concepts within this repository or is it truly a completely standalone Java utility that operates on arbitrary JSON? When you say "users" here, do you mean other Java developers that would take a dependency on whatever artifact defines this utility? You give an example of a "customer" providing two different JSON objects and getting a third one as output, but what is the interface? Is it just Java utility functions or is there some feature to be implemented within this repository that will provide that experience?

jackiehanyang commented 6 months ago

The JSON-to-JSON transformer functions as a standalone utility within the Core package. It enables users to configure transformations from one or multiple JSONs format to another...

Is there any coupling to classes or concepts within this repository or is it truly a completely standalone Java utility that operates on arbitrary JSON? When you say "users" here, do you mean other Java developers that would take a dependency on whatever artifact defines this utility? You give an example of a "customer" providing two different JSON objects and getting a third one as output, but what is the interface? Is it just Java utility functions or is there some feature to be implemented within this repository that will provide that experience?

It is just a Java utility function, and the users of this JSON-JSON transformer utility method are the processor owners. Every processor that needs to perform data transformation could leverage this utility method. Currently, in the 2.15 release cycle, we want to support the example I provided in this RFC by leveraging JSONPath. We will identify the gaps and limitations of JSONPath and aim to support more complicated data manipulation in future releases.

andrross commented 6 months ago

Thanks @jackiehanyang! My first instinct is that I'm hesitant for the OpenSearch repo to become the owner of utility code that is not used within the repo itself, because we certainly have enough code here as it is :) However, I wouldn't stand in the way of adding useful functionality in something like libs/common if it makes sense and is useful for other consumers of that library.

But it sounds like you've got a path forward in the short term and we can revisit this once you get more information on using JSONPath.

opensearch-project / OpenSearch