opensearch-project / neural-search

Plugin that adds dense neural retrieval into the OpenSearch ecosytem
Apache License 2.0
66 stars 67 forks source link

[FEATURE] Allow user-defined functions for score normalization and combination in hybrid queries #994

Open martin-gaievski opened 1 week ago

martin-gaievski commented 1 week ago

Is your feature request related to a problem?

Currently, OpenSearch provides internal functionality for score normalization and combination in hybrid queries, as outlined in the Normalization Processor documentation. However, there is a need to allow users to define their own custom functions for these operations instead of solely relying on OpenSearch's internal mechanisms.

The ability to define user-specific functions will provide more flexibility and control over how scores are normalized and combined, especially for advanced use cases where the built-in functionality may not suffice.

What solution would you like?

Introduce the ability for users to define custom functions for score normalization and combination in hybrid queries. These functions could be implemented using:

We can go and further and try to implement the support for invoking external scripts (e.g., Python or SQL) for even more sophisticated logic, where the internal scripting options may not be sufficient. This could allow users to execute pre-defined models or complex scoring algorithms that are managed externally.

Benefits:

Use Case:

For example, a user may want to combine results from multiple models (e.g., semantic search and traditional keyword search) and apply a custom score normalization function. Using a Painless script, the user could adjust the score of each result based on a combination of the model score and some external business logic (e.g., boosting certain results based on document metadata or user preferences).

Alternatively, a user might prefer to use a Python script to implement a more complex machine learning model for score normalization, offering them the flexibility to include custom ranking logic, external data, or ML-based techniques.

navneet1v commented 21 hours ago

@martin-gaievski I remember MLcommons has the pre and post functions that can run on the embeddings we should see how they are doing it.