opensearch-project / ml-commons

ml-commons provides a set of common machine learning algorithms, e.g. k-means, or linear regression, to help developers build ML related features within OpenSearch.
Apache License 2.0
99 stars 136 forks source link

[BUG] unable to run knn search with neural query on OS 2.16.0 #2838

Open IanMenendez opened 3 months ago

IanMenendez commented 3 months ago

What is the bug? Searching with neural query brings down OS 2.16.0.

This is happening in OS 2.16.0 Image FROM opensearchproject/opensearch:2.16.0 but not in 2.15.0 or lower

How to reproduce the bug?

  1. Upload and deploy ML model
    
    We have a custom transformers model but you can upload another one with:

POST /_plugins/_ml/models/_register { "name": "huggingface/sentence-transformers/msmarco-distilbert-base-tas-b", "version": "1.0.2", "model_group_id": "Z1eQf4oB5Vm0Tdw8EIP2", "model_format": "TORCH_SCRIPT" }

POST /_plugins/_ml/models/dxyObJEBGnTvwYNln7p8/_deploy


3. Create index with settings knn = true and an ingest pipeline for the model

PUT _ingest/pipeline/test { "description": "", "processors": [ { "text_embedding": { "model_id": "dxyObJEBGnTvwYNln7p8", "field_map": { "text": "text_embedding" } } } ] }

PUT /testing { "settings": { "index.knn": true, "index.default_pipeline": "test" }, "mappings": { "properties": { "text": { "type": "text" }, "text_embedding": { "type": "knn_vector", "dimension": 2 } } } }


5. ingest some docs
POST /testing/_doc

{ "text": "testing knn" }

6. Search with a neural query

POST /testing/_search { "query": { "neural": { "text_embedding": { "model_id": "dxyObJEBGnTvwYNln7p8", "query_text": "testing_neural" } } } }


7. Cluster goes down with: 

opensearch-node2 | fatal error in thread [opensearch[opensearch-node2][refresh][T#3]], exiting opensearch-node2 | java.lang.UnsatisfiedLinkError: /usr/share/opensearch/plugins/opensearch-knn/lib/libopensearchknn_common.so: /usr/share/opensearch/data/ml_cache/pytorch/1.13.1-cpu-precxx11-linux-x86_64/libstdc++.so.6: version `GLIBCXX_3.4.21' not found (required by /usr/share/opensearch/plugins/opensearch-knn/lib/libopensearchknn_util.so) opensearch-node2 | at java.base/jdk.internal.loader.NativeLibraries.load(Native Method) opensearch-node2 | at java.base/jdk.internal.loader.NativeLibraries$NativeLibraryImpl.open(NativeLibraries.java:331) opensearch-node2 | at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:197) opensearch-node2 | at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:139) opensearch-node2 | at java.base/jdk.internal.loader.NativeLibraries.findFromPaths(NativeLibraries.java:259) opensearch-node2 | at java.base/jdk.internal.loader.NativeLibraries.loadLibrary(NativeLibraries.java:251) opensearch-node2 | at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2451) opensearch-node2 | at java.base/java.lang.Runtime.loadLibrary0(Runtime.java:916) opensearch-node2 | at java.base/java.lang.System.loadLibrary(System.java:2063) opensearch-node2 | at org.opensearch.knn.jni.JNICommons.lambda$static$0(JNICommons.java:26) opensearch-node2 | at java.base/java.security.AccessController.doPrivileged(AccessController.java:319) opensearch-node2 | at org.opensearch.knn.jni.JNICommons.(JNICommons.java:25) opensearch-node2 | at org.opensearch.knn.index.codec.transfer.VectorTransferFloat.transfer(VectorTransferFloat.java:68) opensearch-node2 | at org.opensearch.knn.index.codec.transfer.VectorTransferFloat.close(VectorTransferFloat.java:59) opensearch-node2 | at org.opensearch.knn.index.codec.util.KNNCodecUtil.getPair(KNNCodecUtil.java:61) opensearch-node2 | at org.opensearch.knn.index.codec.KNN80Codec.KNN80DocValuesConsumer.addKNNBinaryField(KNN80DocValuesConsumer.java:147) opensearch-node2 | at org.opensearch.knn.index.codec.KNN80Codec.KNN80DocValuesConsumer.addBinaryField(KNN80DocValuesConsumer.java:87) opensearch-node2 | at org.apache.lucene.index.BinaryDocValuesWriter.flush(BinaryDocValuesWriter.java:132) opensearch-node2 | at org.apache.lucene.index.IndexingChain.writeDocValues(IndexingChain.java:424) opensearch-node2 | at org.apache.lucene.index.IndexingChain.flush(IndexingChain.java:282) opensearch-node2 | at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:445) opensearch-node2 | at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:496) opensearch-node2 | at org.apache.lucene.index.DocumentsWriter.maybeFlush(DocumentsWriter.java:450) opensearch-node2 | at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:643) opensearch-node2 | at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:578) opensearch-node2 | at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:381) opensearch-node2 | at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:355) opensearch-node2 | at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:345) opensearch-node2 | at org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112) opensearch-node2 | at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170) opensearch-node2 | at org.opensearch.index.engine.OpenSearchReaderManager.refreshIfNeeded(OpenSearchReaderManager.java:72) opensearch-node2 | at org.opensearch.index.engine.OpenSearchReaderManager.refreshIfNeeded(OpenSearchReaderManager.java:52) opensearch-node2 | at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167) opensearch-node2 | at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240) opensearch-node2 | at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:433) opensearch-node2 | at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:413) opensearch-node2 | at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167) opensearch-node2 | at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:213) opensearch-node2 | at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1865) opensearch-node2 | at org.opensearch.index.engine.InternalEngine.maybeRefresh(InternalEngine.java:1844) opensearch-node2 | at org.opensearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:4648) opensearch-node2 | at org.opensearch.index.IndexService.maybeRefreshEngine(IndexService.java:1157) opensearch-node2 | at org.opensearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1301) opensearch-node2 | at org.opensearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:159) opensearch-node2 | at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:882) opensearch-node2 | at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) opensearch-node2 | at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) opensearch-node2 | at java.base/java.lang.Thread.run(Thread.java:1583)



**What is the expected behavior?**
Dont crash and perform a search

**What is your host/environment?**
Docker with `FROM opensearchproject/opensearch:2.16.0` Image

NOTE: This issue only happens the first time you run a neural query. After running a neural query sometime after the first one, the cluster seems to get the libraries. I do not know what was changed from OS 2.15 to 2.16
heemin32 commented 3 months ago

Does it happen when you run knn query directly as well or it happens only when you query knn field through neural plugin?

naveentatikonda commented 3 months ago

This issue is coming from ml-commons through pytorch/djl. Similar issue - https://github.com/opensearch-project/ml-commons/issues/2563

@IanMenendez Can you please share the configuration and mapping details of the models and indices to replicate the issue.

@ylwu-amzn Can you please take a look into the above issue and confirm if it is coming from ml-commons. Thanks!

IanMenendez commented 3 months ago

Does it happen when you run knn query directly as well or it happens only when you query knn field through neural plugin?

Tested this and it is only happening with neural query, so I think it's better to move this to neural search or ml commons repo? can you do this?

IanMenendez commented 3 months ago

This issue is coming from ml-commons through pytorch/djl. Similar issue - opensearch-project/ml-commons#2563

@IanMenendez Can you please share the configuration and mapping details of the models and indices to replicate the issue.

@ylwu-amzn Can you please take a look into the above issue and confirm if it is coming from ml-commons. Thanks!

Yes, I updated the issue description

IanMenendez commented 3 months ago

I figured out that the issue is solved by itself if you restart the cluster. For some reason then the libraries seem to be there.

The problem is that we use neural query in our testing pipeline and we cannot just restart a cluster during the testing pipeline.

navneet1v commented 3 months ago

This issue is coming from ml-commons through pytorch/djl. Similar issue - opensearch-project/ml-commons#2563

@IanMenendez Can you please share the configuration and mapping details of the models and indices to replicate the issue.

@ylwu-amzn Can you please take a look into the above issue and confirm if it is coming from ml-commons. Thanks!

@IanMenendez as provided by @naveentatikonda the issue seems to be coming from ML Commons. So, ideally this issue should be moved to ML Commons. But we don't have permission to transfer the issue to ML Commons. @opensearch-project/admin can you move this issue to ML Commons repo.

Zhangxunmt commented 3 months ago

what is the operation system you used for producing this error? @IanMenendez

IanMenendez commented 3 months ago

what is the operation system you used for producing this error? @IanMenendez

@Zhangxunmt I replicated the issue in the OS 2.16 docker container FROM opensearchproject/opensearch:2.16.0

But It's also happening on my local machine with Linux Mint 21.1

ylwu-amzn commented 3 months ago

From the error opensearch-node2 | java.lang.UnsatisfiedLinkError: /usr/share/opensearch/plugins/opensearch-knn/lib/libopensearchknn_common.so: /usr/share/opensearch/data/ml_cache/pytorch/1.13.1-cpu-precxx11-linux-x86_64/libstdc++.so.6: versionGLIBCXX_3.4.21' not found (required by /usr/share/opensearch/plugins/opensearch-knn/lib/libopensearchknn_util.so)`

It's KNN plugin can't find libopensearchknn_common.so.

@IanMenendez , can you run predict API directly ?

POST _plugins/_ml/models/<your_model_id>/_predict
{
  "text_docs": ["hello"]
}

If this can work, then it's not from ml-commons. KNN team can help take a look.

IanMenendez commented 2 months ago

@ylwu-amzn I can confirm that using the predict API directly does not crash the cluster.

Using neural query crashes the cluster

ylwu-amzn commented 2 months ago

@navneet1v , as predict API works correctly, I think the issue is from KNN, the log also shows it's related with KNN java.lang.UnsatisfiedLinkError: /usr/share/opensearch/plugins/opensearch-knn/lib/libopensearchknn_common.so , can KNN team help take a look ?

I have no permission to k-NN plugin repo, so can't transfer it to k-NN repo

yuye-aws commented 2 months ago

Also meet with the same issue. Can you take a look @navneet1v @martin-gaievski ?

yuye-aws commented 2 months ago

@IanMenendez Maybe you can try this workaround: https://forum.opensearch.org/t/issue-with-opensearch-knn/12633/3

yuye-aws commented 2 months ago

Here is my error log

=== Standard error of node `node{::integTest-0}` ===
»   ↓ last 40 non error or warning messages from /Users/yuyezhu/Desktop/Code/neural-search/build/testclusters/integTest-0/logs/opensearch.stderr.log ↓
» WARNING: Using incubator modules: jdk.incubator.vector
»  WARNING: A terminally deprecated method in java.lang.System has been called
»  WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.OpenSearch (file:/Users/yuyezhu/Desktop/Code/neural-search/build/testclusters/integTest-0/distro/3.0.0-ARCHIVE/lib/opensearch-3.0.0-SNAPSHOT.jar)
»  WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.OpenSearch
»  WARNING: System::setSecurityManager will be removed in a future release
»  WARNING: A terminally deprecated method in java.lang.System has been called
»  WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.Security (file:/Users/yuyezhu/Desktop/Code/neural-search/build/testclusters/integTest-0/distro/3.0.0-ARCHIVE/lib/opensearch-3.0.0-SNAPSHOT.jar)
»  WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.Security
»  WARNING: System::setSecurityManager will be removed in a future release
»  fatal error in thread [opensearch[integTest-0][refresh][T#3]], exiting
»  java.lang.UnsatisfiedLinkError: no opensearchknn_common in java.library.path: /Users/yuyezhu/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:.
»       at java.base/java.lang.ClassLoader.loadLibrary(ClassLoader.java:2458)
»       at java.base/java.lang.Runtime.loadLibrary0(Runtime.java:916)
»       at java.base/java.lang.System.loadLibrary(System.java:2063)
»       at org.opensearch.knn.jni.JNICommons.lambda$static$0(JNICommons.java:26)
»       at java.base/java.security.AccessController.doPrivileged(AccessController.java:319)
»       at org.opensearch.knn.jni.JNICommons.<clinit>(JNICommons.java:25)
»       at org.opensearch.knn.index.codec.transfer.OffHeapFloatVectorTransfer.transfer(OffHeapFloatVectorTransfer.java:24)
»       at org.opensearch.knn.index.codec.transfer.OffHeapVectorTransfer.transfer(OffHeapVectorTransfer.java:57)
»       at org.opensearch.knn.index.codec.nativeindex.DefaultIndexBuildStrategy.buildAndWriteIndex(DefaultIndexBuildStrategy.java:70)
»       at org.opensearch.knn.index.codec.nativeindex.NativeIndexWriter.buildAndWriteIndex(NativeIndexWriter.java:154)
»       at org.opensearch.knn.index.codec.nativeindex.NativeIndexWriter.flushIndex(NativeIndexWriter.java:111)
»       at org.opensearch.knn.index.codec.KNN990Codec.NativeEngines990KnnVectorsWriter.trainAndIndex(NativeEngines990KnnVectorsWriter.java:265)
»       at org.opensearch.knn.index.codec.KNN990Codec.NativeEngines990KnnVectorsWriter.flush(NativeEngines990KnnVectorsWriter.java:87)
»       at org.apache.lucene.codecs.perfield.PerFieldKnnVectorsFormat$FieldsWriter.flush(PerFieldKnnVectorsFormat.java:115)
»       at org.apache.lucene.index.VectorValuesConsumer.flush(VectorValuesConsumer.java:76)
»       at org.apache.lucene.index.IndexingChain.flush(IndexingChain.java:296)
»       at org.apache.lucene.index.DocumentsWriterPerThread.flush(DocumentsWriterPerThread.java:445)
»       at org.apache.lucene.index.DocumentsWriter.doFlush(DocumentsWriter.java:496)
»       at org.apache.lucene.index.DocumentsWriter.maybeFlush(DocumentsWriter.java:450)
»       at org.apache.lucene.index.DocumentsWriter.flushAllThreads(DocumentsWriter.java:643)
»       at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:578)
»       at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:381)
»       at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:355)
»       at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:345)
»       at org.apache.lucene.index.FilterDirectoryReader.doOpenIfChanged(FilterDirectoryReader.java:112)
»       at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:170)
»       at org.opensearch.index.engine.OpenSearchReaderManager.refreshIfNeeded(OpenSearchReaderManager.java:72)
»       at org.opensearch.index.engine.OpenSearchReaderManager.refreshIfNeeded(OpenSearchReaderManager.java:52)
»       at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167)
»       at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240)
»       at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:433)
»       at org.opensearch.index.engine.InternalEngine$ExternalReaderManager.refreshIfNeeded(InternalEngine.java:413)
»       at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:167)
»       at org.apache.lucene.search.ReferenceManager.maybeRefresh(ReferenceManager.java:213)
»       at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1774)
»       at org.opensearch.index.engine.InternalEngine.maybeRefresh(InternalEngine.java:1753)
»       at org.opensearch.index.shard.IndexShard.scheduledRefresh(IndexShard.java:4633)
»       at org.opensearch.index.IndexService.maybeRefreshEngine(IndexService.java:1179)
»       at org.opensearch.index.IndexService$AsyncRefreshTask.runInternal(IndexService.java:1323)
»       at org.opensearch.common.util.concurrent.AbstractAsyncTask.run(AbstractAsyncTask.java:159)
»       at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:923)
»       at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
»       at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
»       at java.base/java.lang.Thread.run(Thread.java:1583)
martin-gaievski commented 2 months ago

@yuye-aws this error looks specific to native knn engines. Can you check which one you're using, if it's one of native ones (faiss or nmslib) can you try defining lucene as your knn engine.

ylwu-amzn commented 2 months ago

@martin-gaievski , can you transfer this issue to k-nn plugin repo ? I think change to lucene can work , but we should also support other two native engines. Suggest K-NN team to try this, seems very easy to reproduce.

yuye-aws commented 2 months ago

@yuye-aws this error looks specific to native knn engines. Can you check which one you're using, if it's one of native ones (faiss or nmslib) can you try defining lucene as your knn engine.

My problem has been resolved after this index mapping. Thank you @martin-gaievski ! Do you think the original issue problem can be resolved in a similar manner?

  "mappings": {
    "properties": {
      "text_embedding": {
        "type": "nested",
        "properties": {
          "knn": {
            "type": "knn_vector",
            "dimension": 768,
            "method": {
              "name": "hnsw",
              "engine": "lucene"
            }
          }
        }
      }
    }
  }
IanMenendez commented 2 months ago

Changing the index mapping to use lucene as it was suggested worked. But I still think OS should not crash the cluster if libstdc++.so.6 is not found. It should catch the exception and throw an error