opensearch-project / OpenSearch

🔎 Open source distributed and RESTful search engine.
https://opensearch.org/docs/latest/opensearch/index/
Apache License 2.0
9.05k stars 1.67k forks source link

[BUG] NPE related to existing shards allocation #14570

Open ngc4579 opened 1 week ago

ngc4579 commented 1 week ago

Describe the bug

We've just updated OpenSearch to version 2.15.0 and enabled batching for the ExistingShardsAllocator:

cluster.allocator.existing_shards_allocator.batch_enabled: true

After that, every now and then one of our clients feeding documents into OpenSearch started throwing exceptions with this root cause reported:

OpenSearchException[OpenSearch exception [type=null_pointer_exception, reason=Cannot invoke "org.opensearch.cluster.routing.AllocationId.getId()" because the return value of "org.opensearch.cluster.routing.ShardRouting.allocationId()" is null]]

Unfortunately, this is the only log line I got - OpenSearch actually does not log any information. The line above gets reported by an OpenSearch Connector pushing data from Kafka Topics in to OpenSearch.

The entire stack trace from the client (OpenSearch Connector for Kafka) looks like this:

"org.apache.kafka.connect.errors.ConnectException: Exiting WorkerSinkTask due to unrecoverable exception.
    at org.apache.kafka.connect.runtime.WorkerSinkTask.deliverMessages(WorkerSinkTask.java:632)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.poll(WorkerSinkTask.java:350)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.iteration(WorkerSinkTask.java:250)
    at org.apache.kafka.connect.runtime.WorkerSinkTask.execute(WorkerSinkTask.java:219)
    at org.apache.kafka.connect.runtime.WorkerTask.doRun(WorkerTask.java:204)
    at org.apache.kafka.connect.runtime.WorkerTask.run(WorkerTask.java:259)
    at org.apache.kafka.connect.runtime.isolation.Plugins.lambda$withClassLoader$1(Plugins.java:237)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
    at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
    at java.base/java.lang.Thread.run(Thread.java:840)\nCaused by: org.apache.kafka.connect.errors.ConnectException: Failed to bulk processing after total of 6 attempt(s)
    at io.aiven.kafka.connect.opensearch.RetryUtil.callWithRetry(RetryUtil.java:144)
    at io.aiven.kafka.connect.opensearch.BulkProcessor$BulkTask.execute(BulkProcessor.java:398)
    at io.aiven.kafka.connect.opensearch.BulkProcessor$BulkTask.call(BulkProcessor.java:370)
    at io.aiven.kafka.connect.opensearch.BulkProcessor$BulkTask.call(BulkProcessor.java:351)
    ... 4 more
Caused by: io.aiven.kafka.connect.opensearch.BulkProcessor$BulkTask$1RetriableError: One of the item in the bulk response failed. Reason: OpenSearchException[OpenSearch exception [type=null_pointer_exception, reason=Cannot invoke \"org.opensearch.cluster.routing.AllocationId.getId()\" because the return value of \"org.opensearch.cluster.routing.ShardRouting.allocationId()\" is null]]
    at io.aiven.kafka.connect.opensearch.BulkProcessor$BulkTask.lambda$execute$0(BulkProcessor.java:420)
    at io.aiven.kafka.connect.opensearch.RetryUtil.callWithRetry(RetryUtil.java:119)
    ... 7 more"

Related component

Other

To Reproduce

I'm actually unsure how to reproduce this - the issue appeared after enabling cluster.allocator.existing_shards_allocator.batch_enabled.

Index configuration:

{
  "index.codec": "zstd_no_dict",
  "index.codec.compression_level": "3",
  "index.creation_date": "1715975376188",
  "index.number_of_replicas": "1",
  "index.number_of_shards": "3",
  "index.provided_name": "<redacted>.folder.document",
  "index.refresh_interval": "30s",
  "index.replication.type": "SEGMENT",
  "index.uuid": "WuNzIa2ZQKefXT5e2AGI9w",
  "index.version.created": "136357827"
}

Cluster configuration:

{
  "cluster_name": "opensearch",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 3,
  "number_of_data_nodes": 3,
  "discovered_master": true,
  "discovered_cluster_manager": true,
  "active_primary_shards": 54,
  "active_shards": 107,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}

Expected behavior

No exception thrown.

Additional Details

No response

rajiv-kv commented 6 days ago

@ngc4579 - Would it be possible to provide additional details in To Reproduce section ?

ngc4579 commented 6 days ago

@rajiv-kv Thanks for getting back. I've added index configuration information. As to cluster configuration - what exactly do you need? Output of GET /_cluster/settings?pretty?

Regarding sample bulk payload - I'm unsure if I can provide that as that's handled by the connector feeding data into OpenSearch.

rajiv-kv commented 6 days ago

Thanks @ngc4579 for reverting.

Output of GET /_cluster/health would suffice.

If you cannot share the bulk payload, could you share the field mappings of the index to create a test payload Output of GET /<index>?pretty

ngc4579 commented 6 days ago

@rajiv-kv Output of GET /_cluster/health:

{
  "cluster_name": "opensearch",
  "status": "green",
  "timed_out": false,
  "number_of_nodes": 3,
  "number_of_data_nodes": 3,
  "discovered_master": true,
  "discovered_cluster_manager": true,
  "active_primary_shards": 54,
  "active_shards": 107,
  "relocating_shards": 0,
  "initializing_shards": 0,
  "unassigned_shards": 0,
  "delayed_unassigned_shards": 0,
  "number_of_pending_tasks": 0,
  "number_of_in_flight_fetch": 0,
  "task_max_waiting_in_queue_millis": 0,
  "active_shards_percent_as_number": 100
}

The index field mappings span almost 1000 lines - is that even necessary? I'd rather attach a shortened version, if that's of any help...

shwetathareja commented 5 days ago

@ngc4579 Do you have consistent repro as in when you disable this setting, everything runs fine and you start to see NPE once you enable the setting back again.

cluster.allocator.existing_shards_allocator.batch_enabled: true
shwetathareja commented 5 days ago

Also please enable debug logs for OpenSearch process.