Closed 0ctopus13prime closed 1 month ago
From memory stand-point, not major changes I could observe from benchmark.
The numbers below were measured through time curl -X GET http://localhost:9200/_plugins/_knn/warmup/target_index
.
I made two different experiments of loading a FAISS vector index with different buffer sizes.
sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
)Unlike FAISS, it took almost 81% more time when loading a system cached file. Of course, this case will be rare, as it is expected that KNN will load a vector index whenever a new segment file is baked. And the newly baked segment file likely is not system cached. Increasing buffer size didn't help. Need to find a better way to transfer data from JNI to Java.
Index size : 30G
Machine : -XMS63G -XMX63G
JVM Args : c5ad.12xlarge
Data : random-s-128-10m-euclidean.hdf5
Metric | Task | Baseline-Value | Candidate-Value | Change | Unit |
---|---|---|---|---|---|
Cumulative indexing time of primary shards | 33.7147 | 34.2115 | 1.47% | min | |
Min cumulative indexing time across primary shards | 0.000133333 | 0.00015 | 12.50% | min | |
Median cumulative indexing time across primary shards | 16.8573 | 17.1057 | 1.47% | min | |
Max cumulative indexing time across primary shards | 33.7146 | 34.2113 | 1.47% | min | |
Cumulative indexing throttle time of primary shards | 0 | 0 | 0.00% | min | |
Min cumulative indexing throttle time across primary shards | 0 | 0 | 0.00% | min | |
Median cumulative indexing throttle time across primary shards | 0 | 0 | 0.00% | min | |
Max cumulative indexing throttle time across primary shards | 0 | 0 | 0.00% | min | |
Cumulative merge time of primary shards | 282.601 | 282.996 | 0.14% | min | |
Cumulative merge count of primary shards | 125 | 122 | 2.40% | ||
Min cumulative merge time across primary shards | 0 | 0 | 0.00% | min | |
Median cumulative merge time across primary shards | 141.3 | 141.498 | 0.14% | min | |
Max cumulative merge time across primary shards | 282.601 | 282.996 | 0.14% | min | |
Cumulative merge throttle time of primary shards | 1.04818 | 1.61307 | 53.89% | min | |
Min cumulative merge throttle time across primary shards | 0 | 0 | 0.00% | min | |
Median cumulative merge throttle time across primary shards | 0.524092 | 0.806533 | 53.89% | min | |
Max cumulative merge throttle time across primary shards | 1.04818 | 1.61307 | 53.89% | min | |
Cumulative refresh time of primary shards | 1.1042 | 1.14667 | 3.85% | min | |
Cumulative refresh count of primary shards | 88 | 85 | 3.41% | ||
Min cumulative refresh time across primary shards | 0.000333333 | 0.000383333 | 15.00% | min | |
Median cumulative refresh time across primary shards | 0.5521 | 0.573333 | 3.85% | min | |
Max cumulative refresh time across primary shards | 1.10387 | 1.14628 | 3.84% | min | |
Cumulative flush time of primary shards | 11.5126 | 10.9446 | 4.93% | min | |
Cumulative flush count of primary shards | 56 | 53 | 5.36% | ||
Min cumulative flush time across primary shards | 0 | 0 | 0.00% | min | |
Median cumulative flush time across primary shards | 5.75628 | 5.47229 | 4.93% | min | |
Max cumulative flush time across primary shards | 11.5126 | 10.9446 | 4.93% | min | |
Total Young Gen GC time | 0.338 | 0.342 | 1.18% | s | |
Total Young Gen GC count | 19 | 19 | 0.00% | ||
Total Old Gen GC time | 0 | 0 | 0.00% | s | |
Total Old Gen GC count | 0 | 0 | 0.00% | ||
Store size | 29.8586 | 29.8584 | 0.00% | GB | |
Translog size | 5.83E-07 | 5.83E-07 | 0.00% | GB | |
Heap used for segments | 0 | 0 | 0.00% | MB | |
Heap used for doc values | 0 | 0 | 0.00% | MB | |
Heap used for terms | 0 | 0 | 0.00% | MB | |
Heap used for norms | 0 | 0 | 0.00% | MB | |
Heap used for points | 0 | 0 | 0.00% | MB | |
Heap used for stored fields | 0 | 0 | 0.00% | MB | |
Segment count | 2 | 2 | 0.00% | ||
Min Throughput | custom-vector-bulk | 5390.31 | 5466.23 | 1.41% | docs/s |
Mean Throughput | custom-vector-bulk | 11041.1 | 10866 | 1.59% | docs/s |
Median Throughput | custom-vector-bulk | 10377.9 | 10065.6 | 3.01% | docs/s |
Max Throughput | custom-vector-bulk | 20105.1 | 19337.8 | 3.82% | docs/s |
50th percentile latency | custom-vector-bulk | 78.4349 | 75.8132 | 3.34% | ms |
90th percentile latency | custom-vector-bulk | 165.667 | 158.129 | 4.55% | ms |
99th percentile latency | custom-vector-bulk | 331.043 | 318.269 | 3.86% | ms |
99.9th percentile latency | custom-vector-bulk | 1486.47 | 1487.08 | 0.04% | ms |
99.99th percentile latency | custom-vector-bulk | 2300.48 | 2598.04 | 12.93% | ms |
100th percentile latency | custom-vector-bulk | 5049.72 | 4535.08 | 10.19% | ms |
50th percentile service time | custom-vector-bulk | 78.4349 | 75.8132 | 3.34% | ms |
90th percentile service time | custom-vector-bulk | 165.667 | 158.129 | 4.55% | ms |
99th percentile service time | custom-vector-bulk | 331.043 | 318.269 | 3.86% | ms |
99.9th percentile service time | custom-vector-bulk | 1486.47 | 1487.08 | 0.04% | ms |
99.99th percentile service time | custom-vector-bulk | 2300.48 | 2598.04 | 12.93% | ms |
100th percentile service time | custom-vector-bulk | 5049.72 | 4535.08 | 10.19% | ms |
error rate | custom-vector-bulk | 0 | 0 | 0.00% | % |
Min Throughput | force-merge-segments | 0 | 0 | 0.00% | ops/s |
Mean Throughput | force-merge-segments | 0 | 0 | 0.00% | ops/s |
Median Throughput | force-merge-segments | 0 | 0 | 0.00% | ops/s |
Max Throughput | force-merge-segments | 0 | 0 | 0.00% | ops/s |
100th percentile latency | force-merge-segments | 1.16E+07 | 1.13E+07 | 2.84% | ms |
100th percentile service time | force-merge-segments | 1.16E+07 | 1.13E+07 | 2.84% | ms |
error rate | force-merge-segments | 0 | 0 | 0.00% | % |
Min Throughput | warmup-indices | 0.24 | 0.14 | 41.67% | ops/s |
Mean Throughput | warmup-indices | 0.24 | 0.14 | 41.67% | ops/s |
Median Throughput | warmup-indices | 0.24 | 0.14 | 41.67% | ops/s |
Max Throughput | warmup-indices | 0.24 | 0.14 | 41.67% | ops/s |
100th percentile latency | warmup-indices | 4162.87 | 7127.78 | 71.22% | ms |
100th percentile service time | warmup-indices | 4162.87 | 7127.78 | 71.22% | ms |
error rate | warmup-indices | 0 | 0 | 0.00% | % |
Min Throughput | prod-queries | 0.66 | 0.64 | 3.03% | ops/s |
Mean Throughput | prod-queries | 0.66 | 0.64 | 3.03% | ops/s |
Median Throughput | prod-queries | 0.66 | 0.64 | 3.03% | ops/s |
Max Throughput | prod-queries | 0.66 | 0.64 | 3.03% | ops/s |
50th percentile latency | prod-queries | 3.5832 | 3.83349 | 6.99% | ms |
90th percentile latency | prod-queries | 4.75317 | 4.64172 | 2.34% | ms |
99th percentile latency | prod-queries | 22.1628 | 23.8439 | 7.59% | ms |
100th percentile latency | prod-queries | 1508.36 | 1571.86 | 4.21% | ms |
50th percentile service time | prod-queries | 3.5832 | 3.83349 | 6.99% | ms |
90th percentile service time | prod-queries | 4.75317 | 4.64172 | 2.34% | ms |
99th percentile service time | prod-queries | 22.1628 | 23.8439 | 7.59% | ms |
100th percentile service time | prod-queries | 1508.36 | 1571.86 | 4.21% | ms |
error rate | prod-queries | 0 | 0 | 0.00% | % |
Mean recall@k | prod-queries | 0.42 | 0.43 | 2.38% | |
Mean recall@1 | prod-queries | 0.6 | 0.63 | 5.00% |
This PR is the first commit making the loading layer in native engines available.
you might want to update it to say loading layer for nmslib.
Will holding the merging until root cause the big gap in warmup time. Compared to FAISS, 84% increase is a bit worriesome.
Planning to continue below two tuning plans.
Expect to reduce of 23.16% of the latency.
I hardly think we can make other parts (e.g. JNIEnv_::CallIntMethod
and IndexInput
) parts much faster.
Also it would be worth to try it with bigger buffer size and see how it goes.
__memmove_avx_unaligned_erms : This indicates that the buffer memory is not properly aligned for internal memcpy. We can allocate 64 aligned memory buffer and retry again. Having 64 aligned memory will work for both AVX2 and AVX512.
Remove critical JNI calls (JNIEnv_::CallIntMethod, jni_ReleasePrimitiveArrayCritical) entirely.
We can make Java part to have a native memory via ByteBuffer
, then acquire the pointer in JNI for once. GetDirectBufferAddress
TODO : Can we allocate an aligned memory layout?
In Java
ByteBuffer nativeBuffer = ByteBuffer.allocateDirect(size);
In C++
// Get the pointer to the native memory from buffer at the beginning.
void *nativePtr = (*env)->GetDirectBufferAddress(env, buffer);
And the newly baked segment file likely is not system cached.
Wont page cache typically be write through? In which case, if graph is created and written on same node it is searched on, wont it be cached?
After switching from direct file API usage to an abstract IO loading layer, additional overhead was introduced due to JNI calls and buffer copying via std::memcpy
. This change resulted in a 30% increase in loading time compared to the baseline in FAISS. The baseline took 3.584 seconds to load a 6GB vector index, while the modified version increased the load time to 4.664 seconds.
In NMSLIB, we expected a similar level of performance regression as seen in FAISS. However, we're observing a 70% increase in load time when loading a 6GB vector index. (baseline=4.144 sec, the modified one=7.503 sec) Why is the performance impact in NMSLIB more than twice as severe as in FAISS?
The key performance difference in index loading between FAISS and NMSLIB stems from their file formats. In NMSLIB, this difference results in JNI calls being made O(N) times, where N is the number of vectors, whereas in FAISS, the number of JNI calls is O(1).
FAISS stores chunks of the neighbor list in a single location and loads them all at once. See the code below:
static void read_HNSW(HNSW* hnsw, IOReader* f) {
READVECTOR(hnsw->assign_probas);
READVECTOR(hnsw->cum_nneighbor_per_level);
READVECTOR(hnsw->levels);
READVECTOR(hnsw->offsets);
READVECTOR(hnsw->neighbors);
READ1(hnsw->entry_point);
READ1(hnsw->max_level);
READ1(hnsw->efConstruction);
READ1(hnsw->efSearch);
READ1(hnsw->upper_beam);
}
In NMSLIB, each neighbor list is stored individually, requiring O(N) reads, where N is the total number of vectors.
As shown in the code below, we need totalElementsStored_
read operations.
Note that input.read()
ultimately calls JNI to delegate Lucene’s IndexInput to read bytes thanks to the introduced loading layer. As a result, the number of input.read()
calls directly corresponds to the number of JNI calls.
for (size_t i = 0; i < totalElementsStored_; i++) {
...
} else {
linkLists_[i] = (char *)malloc(linkListSize);
CHECK(linkLists_[i]);
input.read(linkLists_[i], linkListSize); <--------- THIS!
}
data_rearranged_[i] = new Object(data_level0_memory_ + (i)*memoryPerObject_ + offsetData_);
}
We can patch NMSLIB to avoid making JNI calls for each vector element. The idea is to load data in bulk, then parse the neighbor lists from that buffer, rather than reading bytes individually. This approach would reduce the number of JNI calls to O(Index size / Buffer size).
For example, with a 6GB vector index containing 1 million vectors and a 64KB buffer size, the required JNI calls would be reduced to O(6GB / 64KB) = 98,304, which is a significant improvement over 1 million calls, achieving nearly a 90% reduction in operations.
Result: Surprisingly, it is 8% faster than the baseline. (Note: I reindexed on a new single node, which is why the loading time differs from the one mentioned earlier in the issue.)
template <typename dist_t>
void Hnsw<dist_t>::LoadOptimizedIndex(NmslibIOReader& input) {
...
const size_t bufferSize = 64 * 1024; // 64KB
std::unique_ptr<char[]> buffer (new char[bufferSize]);
uint32_t end = 0;
uint32_t pos = 0;
const bool isLTE = _isLittleEndian();
for (size_t i = 0, remainingBytes = input.remaining(); i < totalElementsStored_; i++) {
// Read linkList size integer.
if ((pos + sizeof(SIZEMASS_TYPE)) >= end) {
// Underflow, load bytes in bulk.
const auto firstPartLen = end - pos;
if (firstPartLen > 0) {
std::memcpy(buffer.get(), buffer.get() + pos, firstPartLen);
}
const auto copyBytes = std::min(remainingBytes, bufferSize - firstPartLen);
input.read(buffer.get() + firstPartLen, copyBytes);
remainingBytes -= copyBytes;
end = copyBytes + firstPartLen;
pos = 0;
}
// Read data size. SIZEMASS_TYPE -> uint32_t
SIZEMASS_TYPE linkListSize = 0;
if (isLTE) {
linkListSize = _readIntLittleEndian(buffer[pos], buffer[pos + 1], buffer[pos + 2], buffer[pos + 3]);
} else {
linkListSize = _readIntBigEndian(buffer[pos], buffer[pos + 1], buffer[pos + 2], buffer[pos + 3]);
}
pos += 4;
if (linkListSize == 0) {
linkLists_[i] = nullptr;
} else {
// Now we load neighbor list.
linkLists_[i] = (char *) malloc(linkListSize);
CHECK(linkLists_[i]);
SIZEMASS_TYPE leftLinkListData = linkListSize;
auto dataPtr = linkLists_[i];
while (leftLinkListData > 0) {
if (pos >= end) {
// Underflow, load bytes in bulk.
const auto copyBytes = std::min(remainingBytes, bufferSize);
input.read(buffer.get(), copyBytes);
remainingBytes -= copyBytes;
end = copyBytes;
pos = 0;
}
const auto copyBytes = std::min(leftLinkListData, end - pos);
std::memcpy(dataPtr, buffer.get() + pos, copyBytes);
dataPtr += copyBytes;
leftLinkListData -= copyBytes;
pos += copyBytes;
} // End while
} // End if
data_rearranged_[i] = new Object(data_level0_memory_ + (i)*memoryPerObject_ + offsetData_);
} // End for
...
Since we're deprecating NMSLIB in version 3.x, we can disable loading layer in NMSLIB until then. Or, we can selectively allow streaming in NMSLIB depending on whether the given Directory is FSDirectory implementation.
if (directory instance of Directory) {
loadIndexByFilePath(...);
} else {
loadIndexByStreaming(...);
}
Since we're deprecating NMSLIB in version 3.x, we can tolerate this issue in the short term. However, I personally don't favor this approach, as it impacts the p99 latency metrics, which are rare but could still affect overall cluster performance at the worst case.
@navneet1v @jmazanec15 Could you share your thoughts on the above analysis? Thanks
And the newly baked segment file likely is not system cached.
Wont page cache typically be write through? In which case, if graph is created and written on same node it is searched on, wont it be cached?
Sorry, I just saw it.
Yes it is configured by default in pretty much general file system. But also the default ratio is 10% of memory which means write-back cache size is bounded by 10% of the physical memory.
The reasons that I assumed that it is going to be 'likely' write-back cache does not longer exist are in two fold:
Let me share your thoughts on it! You can call me an aggressive dreamer. 😛
@0ctopus13prime the approach of patching nmslib looks good to me. I think if it is providing a good latency we should do that. Since the pros of having a patch means improvements in load time and also getting away of FSDirectory dependency.
Merging the PR, as we have 2 approvals and author is requesting for merge.,
Description
This PR is the first commit introducing the loading layer in NMSLIB. Please refer to this issue for more details. - https://github.com/opensearch-project/k-NN/issues/2033
FYI : FAISS Loading Layer PR - https://github.com/opensearch-project/k-NN/pull/2139
Related Issues
Resolves #[Issue number to be closed when this PR is merged] https://github.com/opensearch-project/k-NN/issues/2033
Check List
--signoff
.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.