opensearch-project / neural-search

Plugin that adds dense neural retrieval into the OpenSearch ecosytem
Apache License 2.0
61 stars 64 forks source link

[BUG] _bulk update request failing when using text chunking processor pipeline #798

Closed janmederly closed 1 month ago

janmederly commented 3 months ago

Describe the bug

When performing _bulk update request while using text chunking processor I am getting {"took":0,"ingest_took":1,"errors":true,"items":[{"index":{"_index":null,"_id":null,"status":500,"error":{"type":"null_pointer_exception","reason":"Cannot invoke \"Object.toString()\" because the return value of \"java.util.Map.get(Object)\" is null"}}}]}. There is no error when I am not using text chunking processor or when I am using regural update API.

Example request:

curl -H "Content-Type: application/json" -X POST "https://localhost:9200/_bulk" -u "admin:xxxxx" --insecure -d ' { "update": { "_id": "test", "_index": "docs-chunks"} } {"doc": {"text": "testing testing"}, "doc_as_upsert": true} ' Example response:

{"took":0,"ingest_took":1,"errors":true,"items":[{"index":{"_index":null,"_id":null,"status":500,"error":{"type":"null_pointer_exception","reason":"Cannot invoke \"Object.toString()\" because the return value of \"java.util.Map.get(Object)\" is null"}}}]}

Related component

Indexing

To Reproduce

  1. Deploy text model
  2. Create text chunking pipeline
  3. Create index with the text chunking pipeline as default pipeline
  4. Try to post bulk update request
  5. Error should appear

Expected behavior

Sucessfully update opensearch ducuments.

Additional Details

Plugins [opensearch@opensearch-cluster-master-0 ~]$ bin/opensearch-plugin list opensearch-alerting opensearch-anomaly-detection opensearch-asynchronous-search opensearch-cross-cluster-replication opensearch-custom-codecs opensearch-flow-framework opensearch-geospatial opensearch-index-management opensearch-job-scheduler opensearch-knn opensearch-ml opensearch-neural-search opensearch-notifications opensearch-notifications-core opensearch-observability opensearch-performance-analyzer opensearch-reports-scheduler opensearch-security opensearch-security-analytics opensearch-skills opensearch-sql

Host/Environment (please complete the following information):

Additional context ML model used: https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1 Text chunking pipeline:

{ "description": "A text chunking and embedding ingest pipeline", "processors": [ { "text_chunking": { "algorithm": { "fixed_token_length": { "token_limit": 350, "overlap_rate": 0.2, "tokenizer": "standard" } }, "field_map": { "text": "passage_chunk" } } }, { "text_embedding": { "model_id": "ueVVfo4Bvd-X9jaivNwl", "field_map": { "passage_chunk": "passage_embedding" } } } ] }

Index settings and mappings:

{ "settings": { "index": { "number_of_shards": 2, "number_of_replicas": 2, "knn": true, "default_pipeline": "text-chunking-embedding-ingest-pipeline", "analyze": { "max_token_count": 1000000 } } }, "mappings": { "properties": { "text": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "passage_embedding": { "type": "nested", "properties": { "knn": { "type": "knn_vector", "dimension": 384 } } } } } }

peternied commented 3 months ago

[Triage - attendees 1 2 3 4 5] @janmederly Thanks for creating this issue, want to create a pull request to address?

peternied commented 3 months ago

[Triage - attendees 1 2 3 4 5] @opensearch-project/admin Could you transfer this to neural-search this looks related to the processed defined in that codebase (Thanks @andrross)

janmederly commented 3 months ago

No, I dont have an idea how to fix it :).

chishui commented 3 months ago

@yuye-aws could you take a look?

yuye-aws commented 2 months ago

Hi @janmederly ! Can you share your bulk request details? I have tested both bulk index and bulk update, but not received the error with text chunking processor.

yuye-aws commented 2 months ago

Here are my requests:

PUT /_ingest/pipeline/text-chunking-ingest-pipeline
{
  "description": "A text chunking ingest pipeline",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 350,
            "overlap_rate": 0.2,
            "tokenizer": "standard"
          }
        },
        "field_map": {
          "text": "passage_chunk"
        }
      }
    }
  ]
}

PUT testindex
{
  "settings": {
    "index": {
      "number_of_shards": 2,
      "number_of_replicas": 2,
      "knn": true,
      "default_pipeline": "text-chunking-ingest-pipeline",
      "analyze": {
        "max_token_count": 1000000
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text",
        "fields": {
          "keyword": {
            "type": "keyword",
            "ignore_above": 256
          }
        }
      }
    }
  }
}

POST _bulk
{"index":{"_index":"testindex","_id":"0"}}
{"text":"first document"}
{"index":{"_index":"testindex","_id":"1"}}
{"text":"second document"}
{"index":{"_index":"testindex","_id":"2"}}
{"text":"third document"}

POST _bulk
{"update":{"_index":"testindex","_id":"0"}}
{"doc": {"text":"first document"}}
{"update":{"_index":"testindex","_id":"1"}}
{"doc": {"text":"second document"}}
{"update":{"_index":"testindex","_id":"2"}}
{"doc": {"text":"third document"}}
janmederly commented 2 months ago

Hi @yuye-aws , I tried it again with these requests:

PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline-test
{
  "description": "A text chunking and embedding ingest pipeline",
  "processors": [
    {
      "text_chunking": {
        "algorithm": {
          "fixed_token_length": {
            "token_limit": 10,
            "overlap_rate": 0.2,
            "tokenizer": "standard"
          }
        },
        "field_map": {
          "passage_text": "passage_chunk"
        }
      }
    },
    {
      "text_embedding": {
        "model_id": "ueVVfo4Bvd-X9jaivNwl",
        "field_map": {
          "passage_chunk": "passage_chunk_embedding"
        }
      }
    }
  ]
}

PUT testindex5
{
  "settings": {
    "index": {
      "knn": true,
      "default_pipeline": "text-chunking-embedding-ingest-pipeline-test"
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "type": "text"
      },
      "passage_chunk_embedding": {
        "type": "nested",
        "properties": {
          "knn": {
            "type": "knn_vector",
            "dimension": 384
          }
        }
      }
    }
  }
}

POST _bulk
{"index":{"_index":"testindex5","_id":"0"}}
{"text":"first document"}
{"index":{"_index":"testindex5","_id":"1"}}
{"text":"second document"}
{"index":{"_index":"testindex5","_id":"2"}}
{"text":"third document"}

POST _bulk
{"update":{"_index":"testindex5","_id":"0"}}
{"doc": {"text":"first document"}}
{"update":{"_index":"testindex5","_id":"1"}}
{"doc": {"text":"second document"}}
{"update":{"_index":"testindex5","_id":"2"}}
{"doc": {"text":"third document"}}

POST _bulk
{"update":{"_index":"testindex5","_id":"0"}}
{"doc": {"text":"first document"}, "doc_as_upsert": true}
{"update":{"_index":"testindex5","_id":"1"}}
{"doc": {"text":"second document"}, "doc_as_upsert": true}
{"update":{"_index":"testindex5","_id":"2"}}
{"doc": {"text":"third document"}, "doc_as_upsert": true}

The first two bulk requests ran without any problem, but when I added the doc_as_upsert parameter, I got the following response:

{
  "took": 0,
  "ingest_took": 1,
  "errors": true,
  "items": [
    {
      "index": {
        "_index": null,
        "_id": null,
        "status": 500,
        "error": {
          "type": "null_pointer_exception",
          "reason": """Cannot invoke "Object.toString()" because the return value of "java.util.Map.get(Object)" is null"""
        }
      }
    },
    {
      "index": {
        "_index": null,
        "_id": null,
        "status": 500,
        "error": {
          "type": "null_pointer_exception",
          "reason": """Cannot invoke "Object.toString()" because the return value of "java.util.Map.get(Object)" is null"""
        }
      }
    },
    {
      "index": {
        "_index": null,
        "_id": null,
        "status": 500,
        "error": {
          "type": "null_pointer_exception",
          "reason": """Cannot invoke "Object.toString()" because the return value of "java.util.Map.get(Object)" is null"""
        }
      }
    }
  ]
}
yuye-aws commented 2 months ago

I am taking a look into the bug this week. Can you share your model configuration? I doubt the Object.toString() bug is not due to text chunking processor.

janmederly commented 2 months ago

If you meant embedding model configuration, I used these requests:

POST /_plugins/_ml/model_groups/_register
{
  "name": "NLP_model_group",
  "description": "A model group for NLP models",
  "access_mode": "public"
}

POST /_plugins/_ml/models/_register
{
  "name": "huggingface/sentence-transformers/multi-qa-MiniLM-L6-cos-v1",
  "version": "1.0.1",
  "model_group_id": "uQpTfo4BaIPJj8PM5N4T",
  "model_format": "TORCH_SCRIPT"
}

POST /_plugins/_ml/models/ueVVfo4Bvd-X9jaivNwl/_deploy
yuye-aws commented 2 months ago

I am looking into the bug. I find that updating a single document at a time is bug-free. You can temporary continue without bug request.

POST /testindex5/_update/0
{
  "doc": {
    "text": "first updated document"
  },
  "doc_as_upsert": true
}
yuye-aws commented 2 months ago

If you need to use bulk action, you can use indexing instead of update.

yuye-aws commented 2 months ago

There is currently a bug in OpenSearch that bulk update without doc_as_upsert will not go through the ingestion pipeline: https://github.com/opensearch-project/OpenSearch/issues/10864. That's why the bug only happens when doc_as_upsert is specified to be true.

yuye-aws commented 2 months ago

Raised a PR in core to fix this issue: https://github.com/opensearch-project/OpenSearch/pull/14721. Feel free to provide any opinions, thanks!

janmederly commented 2 months ago

Thank you very much

yuye-aws commented 2 months ago

HI @janmederly ! Thanks to Binlong. This PR has been merged into OpenSearch repo: https://github.com/opensearch-project/OpenSearch/pull/12891/. This bug should be resolved since OpenSearch 2.16. Can we close this issue?