0ctopus13prime commented 1 month ago

Description

This PR is the first commit introducing the loading layer in NMSLIB. Please refer to this issue for more details. - https://github.com/opensearch-project/k-NN/issues/2033

FYI : FAISS Loading Layer PR - https://github.com/opensearch-project/k-NN/pull/2139

Related Issues

Resolves #[Issue number to be closed when this PR is merged] https://github.com/opensearch-project/k-NN/issues/2033

Check List

[ ] New functionality includes testing.
[ ] New functionality has been documented.
[ ] API changes companion pull request created.
[ ] Commits are signed per the DCO using --signoff.
[ ] Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

0ctopus13prime commented 1 month ago

Memory monitoring results comparison

From memory stand-point, not major changes I could observe from benchmark.

Baseline

Candidate

0ctopus13prime commented 1 month ago

Loading Time Comparison

The numbers below were measured through time curl -X GET http://localhost:9200/_plugins/_knn/warmup/target_index. I made two different experiments of loading a FAISS vector index with different buffer sizes.

After dropped all file cache from memory. (sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches')
With file cache in the memory.

Observation

Unlike FAISS, it took almost 81% more time when loading a system cached file. Of course, this case will be rare, as it is expected that KNN will load a vector index whenever a new segment file is baked. And the newly baked segment file likely is not system cached. Increasing buffer size didn't help. Need to find a better way to transfer data from JNI to Java.

Experiment

Index size : 30G

1. Baseline (Using fread)

After dropped : 50.448 seconds
With cached : 4.144 seconds

2. Using Stream (4KB)

After dropped : 52.779 seconds
With cached : 7.503 seconds
With cached, 64KB : 6.919 seconds
With cached, 1M : 6.99 seconds 🤔

0ctopus13prime commented 1 month ago

Performance Benchmark

Machine : -XMS63G -XMX63G JVM Args : c5ad.12xlarge Data : random-s-128-10m-euclidean.hdf5

Metric	Task	Baseline-Value	Candidate-Value	Change	Unit
Cumulative indexing time of primary shards		33.7147	34.2115	1.47%	min
Min cumulative indexing time across primary shards		0.000133333	0.00015	12.50%	min
Median cumulative indexing time across primary shards		16.8573	17.1057	1.47%	min
Max cumulative indexing time across primary shards		33.7146	34.2113	1.47%	min
Cumulative indexing throttle time of primary shards		0	0	0.00%	min
Min cumulative indexing throttle time across primary shards		0	0	0.00%	min
Median cumulative indexing throttle time across primary shards		0	0	0.00%	min
Max cumulative indexing throttle time across primary shards		0	0	0.00%	min
Cumulative merge time of primary shards		282.601	282.996	0.14%	min
Cumulative merge count of primary shards		125	122	2.40%
Min cumulative merge time across primary shards		0	0	0.00%	min
Median cumulative merge time across primary shards		141.3	141.498	0.14%	min
Max cumulative merge time across primary shards		282.601	282.996	0.14%	min
Cumulative merge throttle time of primary shards		1.04818	1.61307	53.89%	min
Min cumulative merge throttle time across primary shards		0	0	0.00%	min
Median cumulative merge throttle time across primary shards		0.524092	0.806533	53.89%	min
Max cumulative merge throttle time across primary shards		1.04818	1.61307	53.89%	min
Cumulative refresh time of primary shards		1.1042	1.14667	3.85%	min
Cumulative refresh count of primary shards		88	85	3.41%
Min cumulative refresh time across primary shards		0.000333333	0.000383333	15.00%	min
Median cumulative refresh time across primary shards		0.5521	0.573333	3.85%	min
Max cumulative refresh time across primary shards		1.10387	1.14628	3.84%	min
Cumulative flush time of primary shards		11.5126	10.9446	4.93%	min
Cumulative flush count of primary shards		56	53	5.36%
Min cumulative flush time across primary shards		0	0	0.00%	min
Median cumulative flush time across primary shards		5.75628	5.47229	4.93%	min
Max cumulative flush time across primary shards		11.5126	10.9446	4.93%	min
Total Young Gen GC time		0.338	0.342	1.18%	s
Total Young Gen GC count		19	19	0.00%
Total Old Gen GC time		0	0	0.00%	s
Total Old Gen GC count		0	0	0.00%
Store size		29.8586	29.8584	0.00%	GB
Translog size		5.83E-07	5.83E-07	0.00%	GB
Heap used for segments		0	0	0.00%	MB
Heap used for doc values		0	0	0.00%	MB
Heap used for terms		0	0	0.00%	MB
Heap used for norms		0	0	0.00%	MB
Heap used for points		0	0	0.00%	MB
Heap used for stored fields		0	0	0.00%	MB
Segment count		2	2	0.00%
Min Throughput	custom-vector-bulk	5390.31	5466.23	1.41%	docs/s
Mean Throughput	custom-vector-bulk	11041.1	10866	1.59%	docs/s
Median Throughput	custom-vector-bulk	10377.9	10065.6	3.01%	docs/s
Max Throughput	custom-vector-bulk	20105.1	19337.8	3.82%	docs/s
50th percentile latency	custom-vector-bulk	78.4349	75.8132	3.34%	ms
90th percentile latency	custom-vector-bulk	165.667	158.129	4.55%	ms
99th percentile latency	custom-vector-bulk	331.043	318.269	3.86%	ms
99.9th percentile latency	custom-vector-bulk	1486.47	1487.08	0.04%	ms
99.99th percentile latency	custom-vector-bulk	2300.48	2598.04	12.93%	ms
100th percentile latency	custom-vector-bulk	5049.72	4535.08	10.19%	ms
50th percentile service time	custom-vector-bulk	78.4349	75.8132	3.34%	ms
90th percentile service time	custom-vector-bulk	165.667	158.129	4.55%	ms
99th percentile service time	custom-vector-bulk	331.043	318.269	3.86%	ms
99.9th percentile service time	custom-vector-bulk	1486.47	1487.08	0.04%	ms
99.99th percentile service time	custom-vector-bulk	2300.48	2598.04	12.93%	ms
100th percentile service time	custom-vector-bulk	5049.72	4535.08	10.19%	ms
error rate	custom-vector-bulk	0	0	0.00%	%
Min Throughput	force-merge-segments	0	0	0.00%	ops/s
Mean Throughput	force-merge-segments	0	0	0.00%	ops/s
Median Throughput	force-merge-segments	0	0	0.00%	ops/s
Max Throughput	force-merge-segments	0	0	0.00%	ops/s
100th percentile latency	force-merge-segments	1.16E+07	1.13E+07	2.84%	ms
100th percentile service time	force-merge-segments	1.16E+07	1.13E+07	2.84%	ms
error rate	force-merge-segments	0	0	0.00%	%
Min Throughput	warmup-indices	0.24	0.14	41.67%	ops/s
Mean Throughput	warmup-indices	0.24	0.14	41.67%	ops/s
Median Throughput	warmup-indices	0.24	0.14	41.67%	ops/s
Max Throughput	warmup-indices	0.24	0.14	41.67%	ops/s
100th percentile latency	warmup-indices	4162.87	7127.78	71.22%	ms
100th percentile service time	warmup-indices	4162.87	7127.78	71.22%	ms
error rate	warmup-indices	0	0	0.00%	%
Min Throughput	prod-queries	0.66	0.64	3.03%	ops/s
Mean Throughput	prod-queries	0.66	0.64	3.03%	ops/s
Median Throughput	prod-queries	0.66	0.64	3.03%	ops/s
Max Throughput	prod-queries	0.66	0.64	3.03%	ops/s
50th percentile latency	prod-queries	3.5832	3.83349	6.99%	ms
90th percentile latency	prod-queries	4.75317	4.64172	2.34%	ms
99th percentile latency	prod-queries	22.1628	23.8439	7.59%	ms
100th percentile latency	prod-queries	1508.36	1571.86	4.21%	ms
50th percentile service time	prod-queries	3.5832	3.83349	6.99%	ms
90th percentile service time	prod-queries	4.75317	4.64172	2.34%	ms
99th percentile service time	prod-queries	22.1628	23.8439	7.59%	ms
100th percentile service time	prod-queries	1508.36	1571.86	4.21%	ms
error rate	prod-queries	0	0	0.00%	%
Mean recall@k	prod-queries	0.42	0.43	2.38%
Mean recall@1	prod-queries	0.6	0.63	5.00%

navneet1v commented 1 month ago

This PR is the first commit making the loading layer in native engines available.

you might want to update it to say loading layer for nmslib.

0ctopus13prime commented 1 month ago

Will holding the merging until root cause the big gap in warmup time. Compared to FAISS, 84% increase is a bit worriesome.

0ctopus13prime commented 1 month ago

Streaming Flamegraph

0ctopus13prime commented 1 month ago

Performance tuning plan

Planning to continue below two tuning plans. Expect to reduce of 23.16% of the latency. I hardly think we can make other parts (e.g. JNIEnv_::CallIntMethod and IndexInput) parts much faster. Also it would be worth to try it with bigger buffer size and see how it goes.

__memmove_avx_unaligned_erms : This indicates that the buffer memory is not properly aligned for internal memcpy. We can allocate 64 aligned memory buffer and retry again. Having 64 aligned memory will work for both AVX2 and AVX512.
Remove critical JNI calls (JNIEnv_::CallIntMethod, jni_ReleasePrimitiveArrayCritical) entirely. We can make Java part to have a native memory via ByteBuffer, then acquire the pointer in JNI for once. GetDirectBufferAddress

TODO : Can we allocate an aligned memory layout?

In Java

ByteBuffer nativeBuffer = ByteBuffer.allocateDirect(size);

In C++

// Get the pointer to the native memory from buffer at the beginning.
void *nativePtr = (*env)->GetDirectBufferAddress(env, buffer);

jmazanec15 commented 1 month ago

And the newly baked segment file likely is not system cached.

Wont page cache typically be write through? In which case, if graph is created and written on same node it is searched on, wont it be cached?

0ctopus13prime commented 1 month ago

Baseline flamegraph

Screenshot 2024-10-08 at 7 54 06 PM

0ctopus13prime commented 1 month ago

1. NMSLIB Loading Perf Issue Analysis

2. Performance Degradation In FAISS

After switching from direct file API usage to an abstract IO loading layer, additional overhead was introduced due to JNI calls and buffer copying via std::memcpy. This change resulted in a 30% increase in loading time compared to the baseline in FAISS. The baseline took 3.584 seconds to load a 6GB vector index, while the modified version increased the load time to 4.664 seconds.

In NMSLIB, we expected a similar level of performance regression as seen in FAISS. However, we're observing a 70% increase in load time when loading a 6GB vector index. (baseline=4.144 sec, the modified one=7.503 sec) Why is the performance impact in NMSLIB more than twice as severe as in FAISS?

3. Why is it more than twice as severe as in FAISS?

The key performance difference in index loading between FAISS and NMSLIB stems from their file formats. In NMSLIB, this difference results in JNI calls being made O(N) times, where N is the number of vectors, whereas in FAISS, the number of JNI calls is O(1).

FAISS stores chunks of the neighbor list in a single location and loads them all at once. See the code below:

static void read_HNSW(HNSW* hnsw, IOReader* f) {
    READVECTOR(hnsw->assign_probas);
    READVECTOR(hnsw->cum_nneighbor_per_level);
    READVECTOR(hnsw->levels);
    READVECTOR(hnsw->offsets);
    READVECTOR(hnsw->neighbors);

    READ1(hnsw->entry_point);
    READ1(hnsw->max_level);
    READ1(hnsw->efConstruction);
    READ1(hnsw->efSearch);
    READ1(hnsw->upper_beam);
}

In NMSLIB, each neighbor list is stored individually, requiring O(N) reads, where N is the total number of vectors. As shown in the code below, we need totalElementsStored_ read operations. Note that input.read() ultimately calls JNI to delegate Lucene’s IndexInput to read bytes thanks to the introduced loading layer. As a result, the number of input.read() calls directly corresponds to the number of JNI calls.

for (size_t i = 0; i < totalElementsStored_; i++) {
   ...
    } else {
        linkLists_[i] = (char *)malloc(linkListSize);
        CHECK(linkLists_[i]);
        input.read(linkLists_[i], linkListSize); <--------- THIS!
    }
    data_rearranged_[i] = new Object(data_level0_memory_ + (i)*memoryPerObject_ + offsetData_);
}

4. Solution 1. Patch in NMSLIB

We can patch NMSLIB to avoid making JNI calls for each vector element. The idea is to load data in bulk, then parse the neighbor lists from that buffer, rather than reading bytes individually. This approach would reduce the number of JNI calls to O(Index size / Buffer size).

For example, with a 6GB vector index containing 1 million vectors and a 64KB buffer size, the required JNI calls would be reduced to O(6GB / 64KB) = 98,304, which is a significant improvement over 1 million calls, achieving nearly a 90% reduction in operations.

Result: Surprisingly, it is 8% faster than the baseline. (Note: I reindexed on a new single node, which is why the loading time differs from the one mentioned earlier in the issue.)

Baseline : 4.538 sec
Modified version with 64KB buffer : 4.19 sec

4.1 Pros

No performance degradation. If anything, it is even faster than the baseline.
We can maintain unified set of loading APIs for both NMSLIB and FAISS.

4.2 Cons

Medium size of patch is required in NMSLIB. This may increase burdens on code maintenance.

4.3. Patch in hnsw.cc

template <typename dist_t>
void Hnsw<dist_t>::LoadOptimizedIndex(NmslibIOReader& input) {
    ...

    const size_t bufferSize = 64 * 1024;  // 64KB
    std::unique_ptr<char[]> buffer (new char[bufferSize]);
    uint32_t end = 0;
    uint32_t pos = 0;
    const bool isLTE = _isLittleEndian();

    for (size_t i = 0, remainingBytes = input.remaining(); i < totalElementsStored_; i++) {
        // Read linkList size integer.
        if ((pos + sizeof(SIZEMASS_TYPE)) >= end) {
            // Underflow, load bytes in bulk.
            const auto firstPartLen = end - pos;
            if (firstPartLen > 0) {
                std::memcpy(buffer.get(), buffer.get() + pos, firstPartLen);
            }
            const auto copyBytes = std::min(remainingBytes, bufferSize - firstPartLen);
            input.read(buffer.get() + firstPartLen, copyBytes);
            remainingBytes -= copyBytes;
            end = copyBytes + firstPartLen;
            pos = 0;
        }

        // Read data size. SIZEMASS_TYPE -> uint32_t
        SIZEMASS_TYPE linkListSize = 0;
        if (isLTE) {
            linkListSize = _readIntLittleEndian(buffer[pos], buffer[pos + 1], buffer[pos + 2], buffer[pos + 3]);
        } else {
            linkListSize = _readIntBigEndian(buffer[pos], buffer[pos + 1], buffer[pos + 2], buffer[pos + 3]);
        }
        pos += 4;

        if (linkListSize == 0) {
            linkLists_[i] = nullptr;
        } else {
            // Now we load neighbor list.
            linkLists_[i] = (char *) malloc(linkListSize);
            CHECK(linkLists_[i]);

            SIZEMASS_TYPE leftLinkListData = linkListSize;
            auto dataPtr = linkLists_[i];
            while (leftLinkListData > 0) {
                if (pos >= end) {
                    // Underflow, load bytes in bulk.
                    const auto copyBytes = std::min(remainingBytes, bufferSize);
                    input.read(buffer.get(), copyBytes);
                    remainingBytes -= copyBytes;
                    end = copyBytes;
                    pos = 0;
                }

                const auto copyBytes = std::min(leftLinkListData, end - pos);
                std::memcpy(dataPtr, buffer.get() + pos, copyBytes);
                dataPtr += copyBytes;
                leftLinkListData -= copyBytes;
                pos += copyBytes;
            }  // End while
        }  // End if

        data_rearranged_[i] = new Object(data_level0_memory_ + (i)*memoryPerObject_ + offsetData_);
    }  // End for

...

5. Solution 2. Disable Streaming When FSDirectory

Since we're deprecating NMSLIB in version 3.x, we can disable loading layer in NMSLIB until then. Or, we can selectively allow streaming in NMSLIB depending on whether the given Directory is FSDirectory implementation.

if (directory instance of Directory) {
  loadIndexByFilePath(...);
} else {
  loadIndexByStreaming(...);
}

5.1. Pros :

Simple.

5.2. Cons :

Until 3.x, we need to maintain duplicated and similar version of APIs in both Java and JNI.

6. Solution 3. Live with it :)

Since we're deprecating NMSLIB in version 3.x, we can tolerate this issue in the short term. However, I personally don't favor this approach, as it impacts the p99 latency metrics, which are rare but could still affect overall cluster performance at the worst case.

7. Micro Tuning Results

CallNonvirtualIntMethod → No impacts.
AVX 2 intrinsic copy → No impacts.
Use native ByteBuffer + one additional bytes copy → Made it worse.
Increasing buffer size → Increasing 4KB to 64KB at least reduced the warm-up time by 0.8 seconds in NMSLIB.

0ctopus13prime commented 1 month ago

@navneet1v @jmazanec15 Could you share your thoughts on the above analysis? Thanks

0ctopus13prime commented 1 month ago

And the newly baked segment file likely is not system cached.

Wont page cache typically be write through? In which case, if graph is created and written on same node it is searched on, wont it be cached?

Sorry, I just saw it.

Yes it is configured by default in pretty much general file system. But also the default ratio is 10% of memory which means write-back cache size is bounded by 10% of the physical memory.

The reasons that I assumed that it is going to be 'likely' write-back cache does not longer exist are in two fold:

In normal case, writing and reading happens simultaneously. LRU paging eviction is the one being used in Linux, write-back cache is very likely to be evicted soon.
On top of the point 1, NRT in Lucene periodically exposes a new baked segment. This typically takes few seconds, so I assumed most write-back cached pages are kicked out meanwhile.

Let me share your thoughts on it! You can call me an aggressive dreamer. 😛

navneet1v commented 1 month ago

@0ctopus13prime the approach of patching nmslib looks good to me. I think if it is providing a good latency we should do that. Since the pros of having a patch means improvements in load time and also getting away of FSDirectory dependency.

navneet1v commented 1 month ago

Merging the PR, as we have 2 approvals and author is requesting for merge.,

opensearch-project / k-NN

Introduce a loading layer in NMSLIB. #2185

Description

Related Issues

Check List

Memory monitoring results comparison

Baseline

Candidate

Loading Time Comparison

Observation

Experiment

1. Baseline (Using fread)

2. Using Stream (4KB)

Performance Benchmark

Streaming Flamegraph

Performance tuning plan

Baseline flamegraph

1. NMSLIB Loading Perf Issue Analysis

2. Performance Degradation In FAISS

3. Why is it more than twice as severe as in FAISS?

4. Solution 1. Patch in NMSLIB

4.1 Pros

4.2 Cons

4.3. Patch in hnsw.cc

5. Solution 2. Disable Streaming When FSDirectory

5.1. Pros :

5.2. Cons :

6. Solution 3. Live with it :)

7. Micro Tuning Results