Closed janmederly closed 1 month ago
No, I dont have an idea how to fix it :).
@yuye-aws could you take a look?
Hi @janmederly ! Can you share your bulk request details? I have tested both bulk index and bulk update, but not received the error with text chunking processor.
Here are my requests:
PUT /_ingest/pipeline/text-chunking-ingest-pipeline
{
"description": "A text chunking ingest pipeline",
"processors": [
{
"text_chunking": {
"algorithm": {
"fixed_token_length": {
"token_limit": 350,
"overlap_rate": 0.2,
"tokenizer": "standard"
}
},
"field_map": {
"text": "passage_chunk"
}
}
}
]
}
PUT testindex
{
"settings": {
"index": {
"number_of_shards": 2,
"number_of_replicas": 2,
"knn": true,
"default_pipeline": "text-chunking-ingest-pipeline",
"analyze": {
"max_token_count": 1000000
}
}
},
"mappings": {
"properties": {
"text": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
POST _bulk
{"index":{"_index":"testindex","_id":"0"}}
{"text":"first document"}
{"index":{"_index":"testindex","_id":"1"}}
{"text":"second document"}
{"index":{"_index":"testindex","_id":"2"}}
{"text":"third document"}
POST _bulk
{"update":{"_index":"testindex","_id":"0"}}
{"doc": {"text":"first document"}}
{"update":{"_index":"testindex","_id":"1"}}
{"doc": {"text":"second document"}}
{"update":{"_index":"testindex","_id":"2"}}
{"doc": {"text":"third document"}}
Hi @yuye-aws , I tried it again with these requests:
PUT _ingest/pipeline/text-chunking-embedding-ingest-pipeline-test
{
"description": "A text chunking and embedding ingest pipeline",
"processors": [
{
"text_chunking": {
"algorithm": {
"fixed_token_length": {
"token_limit": 10,
"overlap_rate": 0.2,
"tokenizer": "standard"
}
},
"field_map": {
"passage_text": "passage_chunk"
}
}
},
{
"text_embedding": {
"model_id": "ueVVfo4Bvd-X9jaivNwl",
"field_map": {
"passage_chunk": "passage_chunk_embedding"
}
}
}
]
}
PUT testindex5
{
"settings": {
"index": {
"knn": true,
"default_pipeline": "text-chunking-embedding-ingest-pipeline-test"
}
},
"mappings": {
"properties": {
"text": {
"type": "text"
},
"passage_chunk_embedding": {
"type": "nested",
"properties": {
"knn": {
"type": "knn_vector",
"dimension": 384
}
}
}
}
}
}
POST _bulk
{"index":{"_index":"testindex5","_id":"0"}}
{"text":"first document"}
{"index":{"_index":"testindex5","_id":"1"}}
{"text":"second document"}
{"index":{"_index":"testindex5","_id":"2"}}
{"text":"third document"}
POST _bulk
{"update":{"_index":"testindex5","_id":"0"}}
{"doc": {"text":"first document"}}
{"update":{"_index":"testindex5","_id":"1"}}
{"doc": {"text":"second document"}}
{"update":{"_index":"testindex5","_id":"2"}}
{"doc": {"text":"third document"}}
POST _bulk
{"update":{"_index":"testindex5","_id":"0"}}
{"doc": {"text":"first document"}, "doc_as_upsert": true}
{"update":{"_index":"testindex5","_id":"1"}}
{"doc": {"text":"second document"}, "doc_as_upsert": true}
{"update":{"_index":"testindex5","_id":"2"}}
{"doc": {"text":"third document"}, "doc_as_upsert": true}
The first two bulk requests ran without any problem, but when I added the doc_as_upsert
parameter, I got the following response:
{
"took": 0,
"ingest_took": 1,
"errors": true,
"items": [
{
"index": {
"_index": null,
"_id": null,
"status": 500,
"error": {
"type": "null_pointer_exception",
"reason": """Cannot invoke "Object.toString()" because the return value of "java.util.Map.get(Object)" is null"""
}
}
},
{
"index": {
"_index": null,
"_id": null,
"status": 500,
"error": {
"type": "null_pointer_exception",
"reason": """Cannot invoke "Object.toString()" because the return value of "java.util.Map.get(Object)" is null"""
}
}
},
{
"index": {
"_index": null,
"_id": null,
"status": 500,
"error": {
"type": "null_pointer_exception",
"reason": """Cannot invoke "Object.toString()" because the return value of "java.util.Map.get(Object)" is null"""
}
}
}
]
}
I am taking a look into the bug this week. Can you share your model configuration? I doubt the Object.toString() bug is not due to text chunking processor.
If you meant embedding model configuration, I used these requests:
POST /_plugins/_ml/model_groups/_register
{
"name": "NLP_model_group",
"description": "A model group for NLP models",
"access_mode": "public"
}
POST /_plugins/_ml/models/_register
{
"name": "huggingface/sentence-transformers/multi-qa-MiniLM-L6-cos-v1",
"version": "1.0.1",
"model_group_id": "uQpTfo4BaIPJj8PM5N4T",
"model_format": "TORCH_SCRIPT"
}
POST /_plugins/_ml/models/ueVVfo4Bvd-X9jaivNwl/_deploy
I am looking into the bug. I find that updating a single document at a time is bug-free. You can temporary continue without bug request.
POST /testindex5/_update/0
{
"doc": {
"text": "first updated document"
},
"doc_as_upsert": true
}
If you need to use bulk action, you can use indexing instead of update.
There is currently a bug in OpenSearch that bulk update without doc_as_upsert will not go through the ingestion pipeline: https://github.com/opensearch-project/OpenSearch/issues/10864. That's why the bug only happens when doc_as_upsert is specified to be true.
Raised a PR in core to fix this issue: https://github.com/opensearch-project/OpenSearch/pull/14721. Feel free to provide any opinions, thanks!
Thank you very much
HI @janmederly ! Thanks to Binlong. This PR has been merged into OpenSearch repo: https://github.com/opensearch-project/OpenSearch/pull/12891/. This bug should be resolved since OpenSearch 2.16. Can we close this issue?
Describe the bug
When performing _bulk update request while using text chunking processor I am getting
{"took":0,"ingest_took":1,"errors":true,"items":[{"index":{"_index":null,"_id":null,"status":500,"error":{"type":"null_pointer_exception","reason":"Cannot invoke \"Object.toString()\" because the return value of \"java.util.Map.get(Object)\" is null"}}}]}
. There is no error when I am not using text chunking processor or when I am using regural update API.Example request:
curl -H "Content-Type: application/json" -X POST "https://localhost:9200/_bulk" -u "admin:xxxxx" --insecure -d ' { "update": { "_id": "test", "_index": "docs-chunks"} } {"doc": {"text": "testing testing"}, "doc_as_upsert": true} '
Example response:{"took":0,"ingest_took":1,"errors":true,"items":[{"index":{"_index":null,"_id":null,"status":500,"error":{"type":"null_pointer_exception","reason":"Cannot invoke \"Object.toString()\" because the return value of \"java.util.Map.get(Object)\" is null"}}}]}
Related component
Indexing
To Reproduce
Expected behavior
Sucessfully update opensearch ducuments.
Additional Details
Plugins
[opensearch@opensearch-cluster-master-0 ~]$ bin/opensearch-plugin list opensearch-alerting opensearch-anomaly-detection opensearch-asynchronous-search opensearch-cross-cluster-replication opensearch-custom-codecs opensearch-flow-framework opensearch-geospatial opensearch-index-management opensearch-job-scheduler opensearch-knn opensearch-ml opensearch-neural-search opensearch-notifications opensearch-notifications-core opensearch-observability opensearch-performance-analyzer opensearch-reports-scheduler opensearch-security opensearch-security-analytics opensearch-skills opensearch-sql
Host/Environment (please complete the following information):
Additional context ML model used: https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1 Text chunking pipeline:
{ "description": "A text chunking and embedding ingest pipeline", "processors": [ { "text_chunking": { "algorithm": { "fixed_token_length": { "token_limit": 350, "overlap_rate": 0.2, "tokenizer": "standard" } }, "field_map": { "text": "passage_chunk" } } }, { "text_embedding": { "model_id": "ueVVfo4Bvd-X9jaivNwl", "field_map": { "passage_chunk": "passage_embedding" } } } ] }
Index settings and mappings:
{ "settings": { "index": { "number_of_shards": 2, "number_of_replicas": 2, "knn": true, "default_pipeline": "text-chunking-embedding-ingest-pipeline", "analyze": { "max_token_count": 1000000 } } }, "mappings": { "properties": { "text": { "type": "text", "fields": { "keyword": { "type": "keyword", "ignore_above": 256 } } }, "passage_embedding": { "type": "nested", "properties": { "knn": { "type": "knn_vector", "dimension": 384 } } } } } }