Questions about the code in detail

chasingegg commented 4 years ago

I have read the code in detail, and I got some questions about some corner cases, I hope you guys could help me. Thanks!

at 841 - 846 line, linkLists_[cur_c] = (char *) malloc(size_links_per_element_ * curlevel + 1); why we need a '+1' here? size_links_per_element_ stands for the space allocated for each level, so size_links_perelement * curlevel is enough, no need for a '+1'?
About the prefetch command, I have several questions, when we use addPoint function to build index, it invokes searchBaseLayer function, at 198-199 line,
```
_mm_prefetch(getDataByInternalId(*datal), _MM_HINT_T0); 
_mm_prefetch(getDataByInternalId(*(datal + 1)), _MM_HINT_T0);
```
when use searchKnn function to find knn, it invokes searchBaseLayerST function, at 276-277 line,
```
_mm_prefetch(data_level0_memory_ + (*(data + 1)) * size_data_per_element_ + offsetData_, _MM_HINT_T0); 
_mm_prefetch((char *) (data + 2), _MM_HINT_T0);
```
Actually, the first line of two methods is the same(to fetch the next neighbor's feature vector), but why the second line is different?
In addition, when we use prefetch in a for loop, would it cause the index-out-of-boundary problem at the last position of the for loop since it always tries to fetch the next region of the data array?

yurymalkov commented 4 years ago

Hi @chasingegg ,

Thank you for looking trough the code. And sorry for long response (moving with family to another country). As far as I remember, +1 is supposed to handle the problem with prefetching inside the loop that you've described. It is definitely not essential.

Prefetching outside of the valid memory should not affect the outcome of the function. It can potentially hurt performance (https://stackoverflow.com/questions/40128151/what-happens-if-an-invalid-address-is-prefetched), but I didn't observed any

Methods searchBaseLayerST and searchBaseLayer are indeed should be the same (still plan to unite them in one templated method), so it is more a of a bug.

Actually, most (but not all) of the prefetches have minor effect on the performance which is not so easy to measure.

chasingegg commented 4 years ago

Hi, do you think it is feasible to do inner-one-query parallelism, because it should be a wonderful feature but seems none of the graph-based algorithms have done this?

yurymalkov commented 4 years ago

Hi @chasingegg

I think it is feasible, but comes at some QPS loss compared to data-parallelism. The main problem with inner-query parallelism is that most graph algorithms use list and queues associated with each query.

And if there are multiple threads accessing the collection structures, this leads to having locks every time a thread reads/writes and, thus, performance loss compared to data-parallelism. However, I never measured actual performance loss in practice.

Another way to do this is to have independent parallel searches for the same element and then unite the results. We did that in our work on NSW with a simpler search method (SISAP 2012) - it scaled ok with the number of cores, but the simpler search method was less efficient. It also was easy to implement because we used random enter points in the graph. For HNSW you can do the same by adding several versions of upper level graphs (with little overhead), or, probably, you can some stochasticity (e.g. like relaxed dropout in neural networks) to have different results for search trials for the same query.

nmslib / hnswlib

Questions about the code in detail #176