opensearch-project / neural-search

Plugin that adds dense neural retrieval into the OpenSearch ecosytem
Apache License 2.0
66 stars 67 forks source link

[BUG] Nested fields in field_map cause pipeline to fail. #109

Closed dmille closed 1 year ago

dmille commented 1 year ago

What is the bug?

When defining a field_map containing nested fields, the pipeline fails to compute embeddings.

How can one reproduce the bug?

With the following configuration, using non-nested field-types, embeddings are computed:

PUT /_ingest/pipeline/neural_pipeline
{
  "description": "Neural Search Pipeline for message content",
  "processors": [
    {
      "text_embedding": {
        "model_id": "SXXx8YUBR2ZWhVQIkghB",
        "field_map": {
          "message": "message_embedding"
        }
      }
    }
  ]
}
PUT /neural-test-index
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "neural_pipeline"
    },
    "mappings": {
        "properties": {
            "message_embedding": {
                "type": "knn_vector",
                "dimension": 384,
                "method": {
                    "name": "hnsw",
                    "engine": "lucene"
                }
            },
            "message": { 
                "type": "text"            
            },
            "color": {
                "type": "text"
            }
        }
    }
}

POST /_bulk
{"create":{"_index":"neural-test-index","_id":"0"}}
{"message":"Text 1","color":"red"}
{"create":{"_index":"neural-test-index","_id":"1"}}
{"message":"Text 2","color":"black"}

GET /neural-test-index/_search
DELETE /neural-test-index

With the following configuration using a nested source field, embeddings are not computed:

PUT /_ingest/pipeline/neural_pipeline_nested
{
  "description": "Neural Search Pipeline for message content",
  "processors": [
    {
      "text_embedding": {
        "model_id": "SXXx8YUBR2ZWhVQIkghB",
        "field_map": {
          "message.text": "message_embedding"
        }
      }
    }
  ]
}

PUT /neural-test-index-nested
{
    "settings": {
        "index.knn": true,
        "default_pipeline": "neural_pipeline_nested"
    },
    "mappings": {
        "properties": {
            "message_embedding": {
                "type": "knn_vector",
                "dimension": 384,
                "method": {
                    "name": "hnsw",
                    "engine": "lucene"
                }
            },
            "message.text": { 
                "type": "text"            
            },
            "color": {
                "type": "text"
            }
        }
    }
}

POST /_bulk
{"create":{"_index":"neural-test-index-nested","_id":"0"}}
{"message":{"text":"Text 1"},"color":"red"}
{"create":{"_index":"neural-test-index-nested","_id":"1"}}
{"message":{"text":"Text 2"}, "color": "black"}

GET /neural-test-index-nested/_search

What is the expected behavior?

The neural ingestion pipeline should be able to handle nested fields.

What is your host/environment?

docker image: opensearchproject/opensearch:2.5.0

Do you have any additional context?

The models referenced above were uploaded with the following configuration:

{
  "name": "all-MiniLM-L6-v2",
  "version": "1.0.0",
  "description": "sentence transformers model",
  "model_format": "TORCH_SCRIPT",
  "model_config": {
    "model_type": "bert",
    "embedding_dimension": 384,
    "framework_type": "sentence_transformers"
  },
  "url": "https://github.com/opensearch-project/ml-commons/raw/2.x/ml-algorithms/src/test/resources/org/opensearch/ml/engine/algorithms/text_embedding/all-MiniLM-L6-v2_torchscript_sentence-transformer.zip?raw=true"
}
navneet1v commented 1 year ago

Hi @dmille Thanks for reaching out. I did the experiment and yes the way you are defining the nested field in the pipeline won't work. But the pipeline supports nested fields. To do that please try to create pipeline like this:

PUT /neural-test-index-nested
{
    "description": "Neural Search Pipeline for message content",
    "processors": [
        {
            "text_embedding": {
                "model_id": "SXXx8YUBR2ZWhVQIkghB",
                "field_map": {
                    "message": {
                        "text": "message_embedding"
                    }
                }
            }
        }
    ]
}

The thing is right now TextEmbedding processor doesn't understand "." operator as a nested field operator. I did some test on my side and the above way of creating the processor will work and it will handle the nested fields.

I think this can be something which Plugin can support. I will create a feature request for this feature.

dmille commented 1 year ago

@navneet1v Thanks for the prompt reply! This fixed my problem.

navneet1v commented 1 year ago

I am closing this issue and I have created this new GH issue: https://github.com/opensearch-project/neural-search/issues/110 for tracking.