nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors
https://github.com/nmslib/hnswlib
Apache License 2.0
4.41k stars 655 forks source link

Trying to understand the code #499

Open siddhsql opened 1 year ago

siddhsql commented 1 year ago

Hi Yury,

I am trying to understand hnswlib code. w.r.t. this:

char **linkLists_{nullptr}; 

I understand this is storing adjacency list for l > 0.

Why is this a char** instead of uint**? I would expect an element id to be a 4 byte (unsigned) int. char will only be able to store 256 values. Sorry I am not a c++ developer so bear with me.

Over here:

linklistsizeint *get_linklist(tableint internal_id, int level) const {
        return (linklistsizeint *) (linkLists_[internal_id] + (level - 1) * size_links_per_element_);
    }

Could you explain me what is the purpose of (level - 1) * size_links_per_element_?

and over here:

linkLists_[cur_c] = (char *) malloc(size_links_per_element_ * curlevel + 1);

I see that:

size_links_per_element_ = maxM_ * sizeof(tableint) + sizeof(linklistsizeint);

so this stores the # of bytes for the adjacency list. why are you multiplying this by curlevel? and adding 1 to it?

Is there any README that explains the data layout of hnswlib? thanks.

yurymalkov commented 1 year ago

In C++ the pointers can be casted between different types. char * can be casted to int *. char ** because it is array of pointers to arrays of different sizes (elements have different number of levels, so different array sizes) char is because char is a byte, the minimal usable quant to control the memory.

size_links_per_element_ * curlevel is the size of link list of the upper levels of an element (size_links_per_element_ is size per level). +1 I do not remember, I think that was a homeophatic counter to memory prefetches getting slightly out of bound during search.

(level - 1) * size_links_per_element_ computes the portion of the memory where links reside. -1 is because the links for level 0 are stored in a different array, while this memory has only links from 1 to level.

siddhsql commented 1 year ago

thanks Yury.

On Tue, Aug 22, 2023 at 9:26 PM Yury Malkov @.***> wrote:

In C++ the pointers can be casted between different types. char can be casted to int . char ** because it is array of pointers to arrays of different sizes (elements have different number of levels, so different array sizes) char is because char is a byte, the minimal usable quant to control the memory.

size_links_perelement * curlevel is the size of link list of the upper levels of an element (size_links_perelement is size per level). +1 I do not remember, I think that was a homeophatic counter to memory prefetches getting slightly out of bound during search.

(level - 1) * size_links_perelement computes the portion of the memory where links reside. -1 is because the links for level 0 are stored in a different array, while this memory has only links from 1 to level.

— Reply to this email directly, view it on GitHub https://github.com/nmslib/hnswlib/issues/499#issuecomment-1689261805, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6NWEK2SVRF7L4TIEHCCRCDXWWA77ANCNFSM6AAAAAA3WCHYX4 . You are receiving this because you authored the thread.Message ID: @.***>

siddhsql commented 1 year ago

one more question if I may. Looking at the code you have SIMD optimized code but when the code is compiled the flags necessary to take advantage of SIMD are never enabled - is that correct? because I don't see anywhere turning on the SIMD flags. is the SIMD code still experimental that's why you don't enable the compiler flags?

On Wed, Aug 23, 2023 at 8:43 AM Siddharth Jain @.***> wrote:

thanks Yury.

On Tue, Aug 22, 2023 at 9:26 PM Yury Malkov @.***> wrote:

In C++ the pointers can be casted between different types. char can be casted to int . char ** because it is array of pointers to arrays of different sizes (elements have different number of levels, so different array sizes) char is because char is a byte, the minimal usable quant to control the memory.

size_links_perelement * curlevel is the size of link list of the upper levels of an element (size_links_perelement is size per level). +1 I do not remember, I think that was a homeophatic counter to memory prefetches getting slightly out of bound during search.

(level - 1) * size_links_perelement computes the portion of the memory where links reside. -1 is because the links for level 0 are stored in a different array, while this memory has only links from 1 to level.

— Reply to this email directly, view it on GitHub https://github.com/nmslib/hnswlib/issues/499#issuecomment-1689261805, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6NWEK2SVRF7L4TIEHCCRCDXWWA77ANCNFSM6AAAAAA3WCHYX4 . You are receiving this because you authored the thread.Message ID: @.***>

siddhsql commented 1 year ago

more importantly looking at this code:

// finding the "weakest" element to replace it with the new one
dist_t d_max = fstdistfunc_(getDataByInternalId(cur_c), getDataByInternalId(
selectedNeighbors[idx]),
dist_func_param_);
// Heuristic:
std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<
dist_t, tableint>>, CompareByFirst> candidates;
candidates.emplace(d_max, cur_c);

for (size_t j = 0; j < sz_link_list_other; j++) {
candidates.emplace(
fstdistfunc_(getDataByInternalId(data[j]), getDataByInternalId(
selectedNeighbors[idx]),
dist_func_param_), data[j]);
}

getNeighborsByHeuristic2(candidates, Mcurmax);

int indx = 0;
while (candidates.size() > 0) {
data[indx] = candidates.top().second;
candidates.pop();
indx++;
}

setListCount(ll_other, indx);

This is lines 14-16 in algorithm 1 in your famous paper. IIUC it will delete the connection from neighborhood(e) to the weakest element but since this is a bidirected graph don't you also need to delete the connection from the weakest element to e? i don't see that happening in above code.

On Wed, Aug 23, 2023 at 8:46 AM Siddharth Jain @.***> wrote:

one more question if I may. Looking at the code you have SIMD optimized code but when the code is compiled the flags necessary to take advantage of SIMD are never enabled - is that correct? because I don't see anywhere turning on the SIMD flags. is the SIMD code still experimental that's why you don't enable the compiler flags?

On Wed, Aug 23, 2023 at 8:43 AM Siddharth Jain @.***> wrote:

thanks Yury.

On Tue, Aug 22, 2023 at 9:26 PM Yury Malkov @.***> wrote:

In C++ the pointers can be casted between different types. char can be casted to int . char ** because it is array of pointers to arrays of different sizes (elements have different number of levels, so different array sizes) char is because char is a byte, the minimal usable quant to control the memory.

size_links_perelement * curlevel is the size of link list of the upper levels of an element (size_links_perelement is size per level). +1 I do not remember, I think that was a homeophatic counter to memory prefetches getting slightly out of bound during search.

(level - 1) * size_links_perelement computes the portion of the memory where links reside. -1 is because the links for level 0 are stored in a different array, while this memory has only links from 1 to level.

— Reply to this email directly, view it on GitHub https://github.com/nmslib/hnswlib/issues/499#issuecomment-1689261805, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6NWEK2SVRF7L4TIEHCCRCDXWWA77ANCNFSM6AAAAAA3WCHYX4 . You are receiving this because you authored the thread.Message ID: @.***>

yurymalkov commented 1 year ago

SIMD are never enabled - is that correct?

The flags are set by the compiler at compile time https://stackoverflow.com/questions/28939652/how-to-detect-sse-sse2-avx-avx2-avx-512-avx-128-fma-kcvi-availability-at-compile

this is a bidirected graph don't you also need to delete the connection from the weakest element to e? i don't see that happening in above code.

The graph links are not always bidirectional in HNSW, they are created by bidirectionally connecting elements, but may not stay this way. The pruning might be removed, that would make it always bidirectional, but it will lead to high-degree tail.

siddhsql commented 1 year ago

thanks for that link about AVX flags. I got similar output. but when I look at the code, the flags it uses are USE_AVX and USE_AVX512. and it looks like these flags are not set by:

-Ofast -lrt -DNDEBUG -std=c++20 -DHAVE_CXX0X -march=native -fpic -w -fopenmp -ftree-vectorize -ftree-vectorizer-verbose=0"

On Wed, Aug 23, 2023 at 10:10 PM Yury Malkov @.***> wrote:

SIMD are never enabled - is that correct?

The flags are set by the compiler at compile time https://stackoverflow.com/questions/28939652/how-to-detect-sse-sse2-avx-avx2-avx-512-avx-128-fma-kcvi-availability-at-compile

this is a bidirected graph don't you also need to delete the connection from the weakest element to e? i don't see that happening in above code.

The graph links are not always bidirectional in HNSW, they are created by bidirectionally connecting elements, but may not stay this way. The pruning might be removed, that would make it always bidirectional, but it will lead to high-degree tail.

— Reply to this email directly, view it on GitHub https://github.com/nmslib/hnswlib/issues/499#issuecomment-1691011489, or unsubscribe https://github.com/notifications/unsubscribe-auth/A6NWEK6JWYIGBO2644RJUDLXW3O3LANCNFSM6AAAAAA3WCHYX4 . You are receiving this because you authored the thread.Message ID: @.***>