Open lihuimingxs opened 7 months ago
I'm not sure if it's a client issue, because I used Opensearch java client version is 2.11.1. But I guess it's not.
Does this not happen if you don't use a GPU? It doesn't seem related to GPU usage.
Does this not happen if you don't use a GPU? It doesn't seem related to GPU usage.
Yes, I have not encountered this exception while using the CPU.
@lihuimingxs Can you provide more details? What GPU are you using? What's your cluster setting like how many data nodes, ML nodes, are you using GPU instance as data nodes ?
@lihuimingxs Can you provide more details? What GPU are you using? What's your cluster setting like how many data nodes, ML nodes, are you using GPU instance as data nodes ?
@ylwu-amzn Sure! I have 6 data nodes (including 1 master node) and 2 ML nodes (GPUs). The ML nodes are solely utilized for vector computations and do not store any data.
Tips: Then, due to business requirements, I expanded the number of data nodes to 8 and ML nodes to 4, yet the issue persisted. Hence, I suspect the issue may not be closely related to the number of nodes. I hope my additional explanation provides helpful information.
ML nodes environment:
index:
PUT irp_cre_vec_20240425
{
"aliases": {
"irp_cre_vec": {}
},
"mappings": {
"_source": {
"excludes": [
"embeddingCnContent1",
"embeddingCnContent2"
]
},
"properties": {
"embeddingCnContent1": {
"type": "text",
"index": false
},
"embeddingCnContent2": {
"type": "text",
"index": false
},
"embeddingCnVector1": {
"type": "knn_vector",
"dimension": 1024,
"method": {
"engine": "faiss",
"space_type": "l2",
"name": "hnsw",
"parameters": {}
}
},
"embeddingCnVector2": {
"type": "knn_vector",
"dimension": 1024,
"method": {
"engine": "faiss",
"space_type": "l2",
"name": "hnsw",
"parameters": {}
}
},
"id": {
"type": "keyword"
}
}
},
"settings": {
"index": {
"replication": {
"type": "DOCUMENT"
},
"mapping": {
"total_fields": {
"limit": "1000"
}
},
"search": {
"default_pipeline": "cre_v2_search_model_pipeline"
},
"number_of_shards": "1",
"max_result_window": "10000",
"default_pipeline": "cre_v2_convert_pipeline",
"knn": "true",
"number_of_replicas": "0"
}
}
}
opensearch.yml:
Note: The Opensearch version is 2.12.0
, and the /home/opensearch/opensearch-2.9.0
path in the configuration file indicates only the opensearch installation path, not the Opensearch version.
# ======================== OpenSearch Configuration =========================
#
# NOTE: OpenSearch comes with reasonable defaults for most settings.
# Before you set out to tweak and tune the configuration, make sure you
# understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.opensearch.org
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
#cluster.name: my-application
cluster.name: opensearch-cluster
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
#node.name: node-1
node.name: opensearch-cluster_manager
node.roles: [ cluster_manager, data, ingest ]
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
node.attr.crmTag: "vecPosition"
node.attr.ctsvec: "ctsvec"
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
#path.data: /path/to/data
#
# Path to log files:
#
#path.logs: /path/to/logs
#
# Path to snapshot files:
path.repo: ["/mnt/sfs_turbo"]
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# OpenSearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
#network.host: 192.168.0.1
network.host: 0.0.0.0
#
# Set a custom port for HTTP:
#
http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when this node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.seed_hosts: ["host1", "host2"]
discovery.seed_hosts: ["192.168.1.6", "192.168.1.7", "192.168.1.8", "192.168.1.9", "192.168.1.10", "192.168.1.11"]
#
# Bootstrap the cluster using an initial set of cluster-manager-eligible nodes:
#
#cluster.initial_cluster_manager_nodes: ["node-1", "node-2"]
cluster.initial_cluster_manager_nodes: ["opensearch-cluster_manager"]
#
# For more information, consult the discovery and cluster formation module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
#
# ---------------------------------- Remote Store -----------------------------------
# Controls whether cluster imposes index creation only with remote store enabled
# cluster.remote_store.enabled: true
#
# Repository to use for segment upload while enforcing remote store for an index
# cluster.remote_store.repository: my-repo-1
#
# Controls whether cluster imposes index creation only with translog remote store enabled
# cluster.remote_store.translog.enabled: true
#
# Repository to use for translog upload while enforcing remote store for an index
# cluster.remote_store.translog.repository: my-repo-1
#
# ---------------------------------- Experimental Features -----------------------------------
#
# Gates the visibility of the experimental segment replication features until they are production ready.
#
#opensearch.experimental.feature.segment_replication_experimental.enabled: false
#
#
# Gates the visibility of the index setting that allows persisting data to remote store along with local disk.
# Once the feature is ready for production release, this feature flag can be removed.
#
#opensearch.experimental.feature.remote_store.enabled: false
#
#
# Gates the functionality of a new parameter to the snapshot restore API
# that allows for creation of a new index type that searches a snapshot
# directly in a remote repository without restoring all index data to disk
# ahead of time.
#
#opensearch.experimental.feature.searchable_snapshot.enabled: false
#
#
# Gates the functionality of enabling extensions to work with OpenSearch.
# This feature enables applications to extend features of OpenSearch outside of
# the core.
#
#opensearch.experimental.feature.extensions.enabled: false
#
#
# Gates the concurrent segment search feature. This feature enables concurrent segment search in a separate
# index searcher threadpool.
#
#opensearch.experimental.feature.concurrent_segment_search.enabled: false
######## Start OpenSearch Security Demo Configuration ########
# WARNING: revise all the lines below before you go into production
#plugins.security.ssl.transport.pemcert_filepath: esnode.pem
#plugins.security.ssl.transport.pemkey_filepath: esnode-key.pem
#plugins.security.ssl.transport.pemtrustedcas_filepath: root-ca.pem
#plugins.security.ssl.transport.enforce_hostname_verification: false
#plugins.security.ssl.http.enabled: false
#plugins.security.ssl.http.pemcert_filepath: esnode.pem
#plugins.security.ssl.http.pemkey_filepath: esnode-key.pem
#plugins.security.ssl.http.pemtrustedcas_filepath: root-ca.pem
#plugins.security.allow_unsafe_democertificates: true
#plugins.security.allow_default_init_securityindex: true
#plugins.security.authcz.admin_dn:
# - CN=,OU=client,O=client,L=test, C=de
plugins.security.disabled: false
plugins.security.ssl.transport.pemcert_filepath: /home/opensearch/opensearch-2.9.0/config/node1.pem
plugins.security.ssl.transport.pemkey_filepath: /home/opensearch/opensearch-2.9.0/config/node1-key.pem
plugins.security.ssl.transport.pemtrustedcas_filepath: /home/opensearch/opensearch-2.9.0/config/root-ca.pem
plugins.security.ssl.transport.enforce_hostname_verification: false
#plugins.security.ssl.http.enabled: true
plugins.security.ssl.http.enabled: false
plugins.security.ssl.http.pemcert_filepath: /home/opensearch/opensearch-2.9.0/config/node1.pem
plugins.security.ssl.http.pemkey_filepath: /home/opensearch/opensearch-2.9.0/config/node1-key.pem
plugins.security.ssl.http.pemtrustedcas_filepath: /home/opensearch/opensearch-2.9.0/config/root-ca.pem
plugins.security.allow_unsafe_democertificates: true
plugins.security.allow_default_init_securityindex: true
plugins.security.authcz.admin_dn:
- CN=admin,OU=Taa,O=Carrer,L=Beijing,ST=Beijing,C=CN
plugins.security.nodes_dn:
- CN=100.125.1.250,OU=Taa,O=Carrer,L=Beijing,ST=Beijing,C=CN
plugins.security.audit.type: internal_opensearch
plugins.security.enable_snapshot_restore_privilege: true
plugins.security.check_snapshot_restore_write_privileges: true
plugins.security.restapi.roles_enabled: ["all_access", "security_rest_api_access"]
plugins.security.system_indices.enabled: true
plugins.security.system_indices.indices: [".plugins-ml-config", ".plugins-ml-connector", ".plugins-ml-model-group", ".plugins-ml-model", ".plugins-ml-task", ".opendistro-alerting-config", ".opendistro-alerting-alert*", ".opendistro-anomaly-results*", ".opendistro-anomaly-detector*", ".opendistro-anomaly-checkpoints", ".opendistro-anomaly-detection-state", ".opendistro-reports-*", ".opensearch-notifications-*", ".opensearch-notebooks", ".opensearch-observability", ".ql-datasources", ".opendistro-asynchronous-search-response*", ".replication-metadata-store", ".opensearch-knn-models"]
node.max_local_storage_nodes: 3
######## End OpenSearch Security Demo Configuration ########
What is the bug? A clear and concise description of the bug.
In Opensearch 2.12.0:
When using GPU and initiating concurrent requests using neural retrieval, an ArrayIndexOutOfBoundsException exception was encountered.
I'm not sure if it's a concurrency issue, but what I can know is that a single request is successful, and exceptions only occur when there are concurrent requests.
Number of concurrent requests: More than 5 times
Request:
Exception:
How can one reproduce the bug? Steps to reproduce the behavior:
What is the expected behavior? A clear and concise description of what you expected to happen.
Neural Search is ok when used GPU.
What is your host/environment?
Do you have any screenshots? If applicable, add screenshots to help explain your problem.
Do you have any additional context? Add any other context about the problem.