opensearch-project / k-NN

🆕 Find the k-nearest neighbors (k-NN) for your vector data
https://opensearch.org/docs/latest/search-plugins/knn/index/
Apache License 2.0
156 stars 123 forks source link

Introduce a loading layer in NMSLIB. #2185

Closed 0ctopus13prime closed 1 month ago

0ctopus13prime commented 1 month ago

Description

This PR is the first commit introducing the loading layer in NMSLIB. Please refer to this issue for more details. - https://github.com/opensearch-project/k-NN/issues/2033

FYI : FAISS Loading Layer PR - https://github.com/opensearch-project/k-NN/pull/2139

Related Issues

Resolves #[Issue number to be closed when this PR is merged] https://github.com/opensearch-project/k-NN/issues/2033

Check List

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

0ctopus13prime commented 1 month ago

Memory monitoring results comparison

From memory stand-point, not major changes I could observe from benchmark.

Baseline

nmslib_baseline

Candidate

nmslib-candidate-1
0ctopus13prime commented 1 month ago

Loading Time Comparison

The numbers below were measured through time curl -X GET http://localhost:9200/_plugins/_knn/warmup/target_index. I made two different experiments of loading a FAISS vector index with different buffer sizes.

  1. After dropped all file cache from memory. (sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches')
  2. With file cache in the memory.

Observation

Unlike FAISS, it took almost 81% more time when loading a system cached file. Of course, this case will be rare, as it is expected that KNN will load a vector index whenever a new segment file is baked. And the newly baked segment file likely is not system cached. Increasing buffer size didn't help. Need to find a better way to transfer data from JNI to Java.

Experiment

Index size : 30G

1. Baseline (Using fread)

  1. After dropped : 50.448 seconds
  2. With cached : 4.144 seconds

2. Using Stream (4KB)

  1. After dropped : 52.779 seconds
  2. With cached : 7.503 seconds
  3. With cached, 64KB : 6.919 seconds
  4. With cached, 1M : 6.99 seconds 🤔
0ctopus13prime commented 1 month ago

Performance Benchmark

Machine : -XMS63G -XMX63G JVM Args : c5ad.12xlarge Data : random-s-128-10m-euclidean.hdf5

Metric Task Baseline-Value Candidate-Value Change Unit
Cumulative indexing time of primary shards 33.7147 34.2115 1.47% min
Min cumulative indexing time across primary shards 0.000133333 0.00015 12.50% min
Median cumulative indexing time across primary shards 16.8573 17.1057 1.47% min
Max cumulative indexing time across primary shards 33.7146 34.2113 1.47% min
Cumulative indexing throttle time of primary shards 0 0 0.00% min
Min cumulative indexing throttle time across primary shards 0 0 0.00% min
Median cumulative indexing throttle time across primary shards 0 0 0.00% min
Max cumulative indexing throttle time across primary shards 0 0 0.00% min
Cumulative merge time of primary shards 282.601 282.996 0.14% min
Cumulative merge count of primary shards 125 122 2.40%
Min cumulative merge time across primary shards 0 0 0.00% min
Median cumulative merge time across primary shards 141.3 141.498 0.14% min
Max cumulative merge time across primary shards 282.601 282.996 0.14% min
Cumulative merge throttle time of primary shards 1.04818 1.61307 53.89% min
Min cumulative merge throttle time across primary shards 0 0 0.00% min
Median cumulative merge throttle time across primary shards 0.524092 0.806533 53.89% min
Max cumulative merge throttle time across primary shards 1.04818 1.61307 53.89% min
Cumulative refresh time of primary shards 1.1042 1.14667 3.85% min
Cumulative refresh count of primary shards 88 85 3.41%
Min cumulative refresh time across primary shards 0.000333333 0.000383333 15.00% min
Median cumulative refresh time across primary shards 0.5521 0.573333 3.85% min
Max cumulative refresh time across primary shards 1.10387 1.14628 3.84% min
Cumulative flush time of primary shards 11.5126 10.9446 4.93% min
Cumulative flush count of primary shards 56 53 5.36%
Min cumulative flush time across primary shards 0 0 0.00% min
Median cumulative flush time across primary shards 5.75628 5.47229 4.93% min
Max cumulative flush time across primary shards 11.5126 10.9446 4.93% min
Total Young Gen GC time 0.338 0.342 1.18% s
Total Young Gen GC count 19 19 0.00%
Total Old Gen GC time 0 0 0.00% s
Total Old Gen GC count 0 0 0.00%
Store size 29.8586 29.8584 0.00% GB
Translog size 5.83E-07 5.83E-07 0.00% GB
Heap used for segments 0 0 0.00% MB
Heap used for doc values 0 0 0.00% MB
Heap used for terms 0 0 0.00% MB
Heap used for norms 0 0 0.00% MB
Heap used for points 0 0 0.00% MB
Heap used for stored fields 0 0 0.00% MB
Segment count 2 2 0.00%
Min Throughput custom-vector-bulk 5390.31 5466.23 1.41% docs/s
Mean Throughput custom-vector-bulk 11041.1 10866 1.59% docs/s
Median Throughput custom-vector-bulk 10377.9 10065.6 3.01% docs/s
Max Throughput custom-vector-bulk 20105.1 19337.8 3.82% docs/s
50th percentile latency custom-vector-bulk 78.4349 75.8132 3.34% ms
90th percentile latency custom-vector-bulk 165.667 158.129 4.55% ms
99th percentile latency custom-vector-bulk 331.043 318.269 3.86% ms
99.9th percentile latency custom-vector-bulk 1486.47 1487.08 0.04% ms
99.99th percentile latency custom-vector-bulk 2300.48 2598.04 12.93% ms
100th percentile latency custom-vector-bulk 5049.72 4535.08 10.19% ms
50th percentile service time custom-vector-bulk 78.4349 75.8132 3.34% ms
90th percentile service time custom-vector-bulk 165.667 158.129 4.55% ms
99th percentile service time custom-vector-bulk 331.043 318.269 3.86% ms
99.9th percentile service time custom-vector-bulk 1486.47 1487.08 0.04% ms
99.99th percentile service time custom-vector-bulk 2300.48 2598.04 12.93% ms
100th percentile service time custom-vector-bulk 5049.72 4535.08 10.19% ms
error rate custom-vector-bulk 0 0 0.00% %
Min Throughput force-merge-segments 0 0 0.00% ops/s
Mean Throughput force-merge-segments 0 0 0.00% ops/s
Median Throughput force-merge-segments 0 0 0.00% ops/s
Max Throughput force-merge-segments 0 0 0.00% ops/s
100th percentile latency force-merge-segments 1.16E+07 1.13E+07 2.84% ms
100th percentile service time force-merge-segments 1.16E+07 1.13E+07 2.84% ms
error rate force-merge-segments 0 0 0.00% %
Min Throughput warmup-indices 0.24 0.14 41.67% ops/s
Mean Throughput warmup-indices 0.24 0.14 41.67% ops/s
Median Throughput warmup-indices 0.24 0.14 41.67% ops/s
Max Throughput warmup-indices 0.24 0.14 41.67% ops/s
100th percentile latency warmup-indices 4162.87 7127.78 71.22% ms
100th percentile service time warmup-indices 4162.87 7127.78 71.22% ms
error rate warmup-indices 0 0 0.00% %
Min Throughput prod-queries 0.66 0.64 3.03% ops/s
Mean Throughput prod-queries 0.66 0.64 3.03% ops/s
Median Throughput prod-queries 0.66 0.64 3.03% ops/s
Max Throughput prod-queries 0.66 0.64 3.03% ops/s
50th percentile latency prod-queries 3.5832 3.83349 6.99% ms
90th percentile latency prod-queries 4.75317 4.64172 2.34% ms
99th percentile latency prod-queries 22.1628 23.8439 7.59% ms
100th percentile latency prod-queries 1508.36 1571.86 4.21% ms
50th percentile service time prod-queries 3.5832 3.83349 6.99% ms
90th percentile service time prod-queries 4.75317 4.64172 2.34% ms
99th percentile service time prod-queries 22.1628 23.8439 7.59% ms
100th percentile service time prod-queries 1508.36 1571.86 4.21% ms
error rate prod-queries 0 0 0.00% %
Mean recall@k prod-queries 0.42 0.43 2.38%
Mean recall@1 prod-queries 0.6 0.63 5.00%
navneet1v commented 1 month ago

This PR is the first commit making the loading layer in native engines available.

you might want to update it to say loading layer for nmslib.

0ctopus13prime commented 1 month ago

Will holding the merging until root cause the big gap in warmup time. Compared to FAISS, 84% increase is a bit worriesome.

0ctopus13prime commented 1 month ago

Streaming Flamegraph

Screenshot 2024-10-08 at 5 44 14 PM
0ctopus13prime commented 1 month ago

Performance tuning plan

Planning to continue below two tuning plans. Expect to reduce of 23.16% of the latency. I hardly think we can make other parts (e.g. JNIEnv_::CallIntMethod and IndexInput) parts much faster. Also it would be worth to try it with bigger buffer size and see how it goes.

  1. __memmove_avx_unaligned_erms : This indicates that the buffer memory is not properly aligned for internal memcpy. We can allocate 64 aligned memory buffer and retry again. Having 64 aligned memory will work for both AVX2 and AVX512.

  2. Remove critical JNI calls (JNIEnv_::CallIntMethod, jni_ReleasePrimitiveArrayCritical) entirely. We can make Java part to have a native memory via ByteBuffer, then acquire the pointer in JNI for once. GetDirectBufferAddress

TODO : Can we allocate an aligned memory layout?

In Java

ByteBuffer nativeBuffer = ByteBuffer.allocateDirect(size);
In C++

// Get the pointer to the native memory from buffer at the beginning.
void *nativePtr = (*env)->GetDirectBufferAddress(env, buffer);
jmazanec15 commented 1 month ago

And the newly baked segment file likely is not system cached.

Wont page cache typically be write through? In which case, if graph is created and written on same node it is searched on, wont it be cached?

0ctopus13prime commented 1 month ago

Baseline flamegraph

Screenshot 2024-10-08 at 7 54 06 PM

0ctopus13prime commented 1 month ago

1. NMSLIB Loading Perf Issue Analysis

2. Performance Degradation In FAISS

After switching from direct file API usage to an abstract IO loading layer, additional overhead was introduced due to JNI calls and buffer copying via std::memcpy. This change resulted in a 30% increase in loading time compared to the baseline in FAISS. The baseline took 3.584 seconds to load a 6GB vector index, while the modified version increased the load time to 4.664 seconds.

In NMSLIB, we expected a similar level of performance regression as seen in FAISS. However, we're observing a 70% increase in load time when loading a 6GB vector index. (baseline=4.144 sec, the modified one=7.503 sec) Why is the performance impact in NMSLIB more than twice as severe as in FAISS?

3. Why is it more than twice as severe as in FAISS?

The key performance difference in index loading between FAISS and NMSLIB stems from their file formats. In NMSLIB, this difference results in JNI calls being made O(N) times, where N is the number of vectors, whereas in FAISS, the number of JNI calls is O(1).

FAISS stores chunks of the neighbor list in a single location and loads them all at once. See the code below:

static void read_HNSW(HNSW* hnsw, IOReader* f) {
    READVECTOR(hnsw->assign_probas);
    READVECTOR(hnsw->cum_nneighbor_per_level);
    READVECTOR(hnsw->levels);
    READVECTOR(hnsw->offsets);
    READVECTOR(hnsw->neighbors);

    READ1(hnsw->entry_point);
    READ1(hnsw->max_level);
    READ1(hnsw->efConstruction);
    READ1(hnsw->efSearch);
    READ1(hnsw->upper_beam);
}

In NMSLIB, each neighbor list is stored individually, requiring O(N) reads, where N is the total number of vectors. As shown in the code below, we need totalElementsStored_ read operations. Note that input.read() ultimately calls JNI to delegate Lucene’s IndexInput to read bytes thanks to the introduced loading layer. As a result, the number of input.read() calls directly corresponds to the number of JNI calls.

for (size_t i = 0; i < totalElementsStored_; i++) {
   ...
    } else {
        linkLists_[i] = (char *)malloc(linkListSize);
        CHECK(linkLists_[i]);
        input.read(linkLists_[i], linkListSize); <--------- THIS!
    }
    data_rearranged_[i] = new Object(data_level0_memory_ + (i)*memoryPerObject_ + offsetData_);
}

4. Solution 1. Patch in NMSLIB

We can patch NMSLIB to avoid making JNI calls for each vector element. The idea is to load data in bulk, then parse the neighbor lists from that buffer, rather than reading bytes individually. This approach would reduce the number of JNI calls to O(Index size / Buffer size).

For example, with a 6GB vector index containing 1 million vectors and a 64KB buffer size, the required JNI calls would be reduced to O(6GB / 64KB) = 98,304, which is a significant improvement over 1 million calls, achieving nearly a 90% reduction in operations.

Result: Surprisingly, it is 8% faster than the baseline. (Note: I reindexed on a new single node, which is why the loading time differs from the one mentioned earlier in the issue.)

  1. Baseline : 4.538 sec
  2. Modified version with 64KB buffer : 4.19 sec

4.1 Pros

  1. No performance degradation. If anything, it is even faster than the baseline.
  2. We can maintain unified set of loading APIs for both NMSLIB and FAISS.

4.2 Cons

  1. Medium size of patch is required in NMSLIB. This may increase burdens on code maintenance.

4.3. Patch in hnsw.cc

template <typename dist_t>
void Hnsw<dist_t>::LoadOptimizedIndex(NmslibIOReader& input) {
    ...

    const size_t bufferSize = 64 * 1024;  // 64KB
    std::unique_ptr<char[]> buffer (new char[bufferSize]);
    uint32_t end = 0;
    uint32_t pos = 0;
    const bool isLTE = _isLittleEndian();

    for (size_t i = 0, remainingBytes = input.remaining(); i < totalElementsStored_; i++) {
        // Read linkList size integer.
        if ((pos + sizeof(SIZEMASS_TYPE)) >= end) {
            // Underflow, load bytes in bulk.
            const auto firstPartLen = end - pos;
            if (firstPartLen > 0) {
                std::memcpy(buffer.get(), buffer.get() + pos, firstPartLen);
            }
            const auto copyBytes = std::min(remainingBytes, bufferSize - firstPartLen);
            input.read(buffer.get() + firstPartLen, copyBytes);
            remainingBytes -= copyBytes;
            end = copyBytes + firstPartLen;
            pos = 0;
        }

        // Read data size. SIZEMASS_TYPE -> uint32_t
        SIZEMASS_TYPE linkListSize = 0;
        if (isLTE) {
            linkListSize = _readIntLittleEndian(buffer[pos], buffer[pos + 1], buffer[pos + 2], buffer[pos + 3]);
        } else {
            linkListSize = _readIntBigEndian(buffer[pos], buffer[pos + 1], buffer[pos + 2], buffer[pos + 3]);
        }
        pos += 4;

        if (linkListSize == 0) {
            linkLists_[i] = nullptr;
        } else {
            // Now we load neighbor list.
            linkLists_[i] = (char *) malloc(linkListSize);
            CHECK(linkLists_[i]);

            SIZEMASS_TYPE leftLinkListData = linkListSize;
            auto dataPtr = linkLists_[i];
            while (leftLinkListData > 0) {
                if (pos >= end) {
                    // Underflow, load bytes in bulk.
                    const auto copyBytes = std::min(remainingBytes, bufferSize);
                    input.read(buffer.get(), copyBytes);
                    remainingBytes -= copyBytes;
                    end = copyBytes;
                    pos = 0;
                }

                const auto copyBytes = std::min(leftLinkListData, end - pos);
                std::memcpy(dataPtr, buffer.get() + pos, copyBytes);
                dataPtr += copyBytes;
                leftLinkListData -= copyBytes;
                pos += copyBytes;
            }  // End while
        }  // End if

        data_rearranged_[i] = new Object(data_level0_memory_ + (i)*memoryPerObject_ + offsetData_);
    }  // End for

...            

5. Solution 2. Disable Streaming When FSDirectory

Since we're deprecating NMSLIB in version 3.x, we can disable loading layer in NMSLIB until then. Or, we can selectively allow streaming in NMSLIB depending on whether the given Directory is FSDirectory implementation.

if (directory instance of Directory) {
  loadIndexByFilePath(...);
} else {
  loadIndexByStreaming(...);
}

5.1. Pros :

  1. Simple.

5.2. Cons :

  1. Until 3.x, we need to maintain duplicated and similar version of APIs in both Java and JNI.

6. Solution 3. Live with it :)

Since we're deprecating NMSLIB in version 3.x, we can tolerate this issue in the short term. However, I personally don't favor this approach, as it impacts the p99 latency metrics, which are rare but could still affect overall cluster performance at the worst case.

7. Micro Tuning Results

  1. CallNonvirtualIntMethod → No impacts.
  2. AVX 2 intrinsic copy → No impacts.
  3. Use native ByteBuffer + one additional bytes copy → Made it worse.
  4. Increasing buffer size → Increasing 4KB to 64KB at least reduced the warm-up time by 0.8 seconds in NMSLIB.
0ctopus13prime commented 1 month ago

@navneet1v @jmazanec15 Could you share your thoughts on the above analysis? Thanks

0ctopus13prime commented 1 month ago

And the newly baked segment file likely is not system cached.

Wont page cache typically be write through? In which case, if graph is created and written on same node it is searched on, wont it be cached?

Sorry, I just saw it.

Yes it is configured by default in pretty much general file system. But also the default ratio is 10% of memory which means write-back cache size is bounded by 10% of the physical memory.

The reasons that I assumed that it is going to be 'likely' write-back cache does not longer exist are in two fold:

  1. In normal case, writing and reading happens simultaneously. LRU paging eviction is the one being used in Linux, write-back cache is very likely to be evicted soon.
  2. On top of the point 1, NRT in Lucene periodically exposes a new baked segment. This typically takes few seconds, so I assumed most write-back cached pages are kicked out meanwhile.

Let me share your thoughts on it! You can call me an aggressive dreamer. 😛

navneet1v commented 1 month ago

@0ctopus13prime the approach of patching nmslib looks good to me. I think if it is providing a good latency we should do that. Since the pros of having a patch means improvements in load time and also getting away of FSDirectory dependency.

navneet1v commented 1 month ago

Merging the PR, as we have 2 approvals and author is requesting for merge.,