vespa-engine / vespa

AI + Data, online. https://vespa.ai
https://vespa.ai
Apache License 2.0
5.58k stars 587 forks source link

For HSNW search, load vectors on demand #14930

Closed deepestthought42 closed 3 years ago

deepestthought42 commented 3 years ago

Hi!

While evaluating vespa for a our product we ran into a limitation regarding the size of the HSNW index. Currently, with a rather small (for us) dataset and 512 dimensional vectors to search for, the index has a size of 73GB in memory. Would it be possible to replace the full vectors in the index with a pointer into the file(s) and load the vectors from the file ondemand (with some caching maybe). While this certainly reduces performance greatly, using fast disks and appropriate file chunking might still make it more than fast enough for us with a way better performance to cost ratio.

Thanks!

jobergum commented 3 years ago

@deepestthought42 Thank you for the detailed request. We think we should be able to come up with a solution, no timeline yet though.

arnej27959 commented 3 years ago

I hope you are already using floats (to cut memory usage in half), if not you should do that.

geirst commented 3 years ago

Example on how to specify a float tensor type (https://docs.vespa.ai/documentation/reference/tensor.html#tensor-type-spec):

field my_tensor type tensor<float>(x[512]) {
    indexing: attribute | summary
}
deepestthought42 commented 3 years ago

Thanks for taking this on! @arnej27959 , yes we are using floats exactly as specified above.

bratseth commented 3 years ago

We believe it will be 2-4 weeks to have something ready for trying out (from when we decide to start working on it). @geirst is planning to run some experiment to get an idea of how this would impact performance in practice.

peakji commented 3 years ago

+1 for on-disk ANN indices. AFAIK, theres no open source solution with out-of-the-box support for this right now and it could be a really unique feature for vespa.

Microsoft Research released the DiskANN paper last year, they've proved that on-disk ANN searching is feasible. The paper achieved optimal performance by introducing a new graph building algorithm which helps to reduce disk seeks, but I guess we could still stick to HNSW with on-demand loading, it might be sufficient for various real world use cases which are not that time sensitive, e.g. offline de-duplication and recommendation.

geirst commented 3 years ago

One possible solution is to store the document tensors (vector data) on disk, while the HNSW index is still kept in memory. The tensor field would still be represented as an attribute vector in the search backend, but with a new allocator used when allocating space for the tensor data. This new allocator would use memory-mapped files as backing for the allocations, effectively storing all tensors on disk. The benefits of such solution is that very little of the existing code must be changed. Both feed and search pipelines would be the same.

The question is how good search performance we can achieve if the tensors are stored on disk. The main cost of the HNSW search algorithm is to calculate distances between the query vector and document vectors. I have done some experiments with the SIFT (128 dim) and GIST (960 dim) data sets (http://corpus-texmex.irisa.fr/), to see how many distance calculations we perform on average per query. For SIFT, all 1M documents were fed. For GIST, 300k of 1M documents were fed. In the following table we see the average number of distance calculations per query. Around 1000 nearestNeighbor queries were executed per case.

Dataset targetHits=10 targetHits=100
SIFT 384 1805
GIST 407 1982

If the tensors are stored on disk, we need to read the document tensor before doing the distance calculation. Assume the disk supports 400 MB of random reads per second. The block size is 4 kB, which means we can read 100k blocks per second. A SIFT tensor is 512 bytes and a GIST tensor is 3840 bytes, which means they both fit a 4 kB block. For simplicity, assume 400 distance calculations per query when targetHits=10, and 2000 when targetHits=100.

This gives a maximum theoretical 250 QPS when targetHits=10, and 50 QPS when targetHits=100 on a single node. For some applications this might be a trade-off worth taking. For comparison, we currently achieve 9500 QPS (end-to-end) when searching the SIFT data set with targetHits=100 on our performance test nodes (24 cores at 2.50 GHz).

geirst commented 3 years ago

I have performed the SIFT (128 dim) experiment again with more detailed sampling of distance calculations. In this case the sampling is done per level in the HNSW graph. This is summarised in the following table:

Level Vector points Avg dist calc (targetHits=10) Avg dist calc (targetHits=100)
0 1000000 281 1708
1 62471 32 32
2 3994 32 32
3 257 28 28
4 16 13 13
5 1 1 1
Total - 387 1814

As seen, the majority of calculations are performed at level 0 where all vector points exist, and the number of calculations depend on targetHits. The access of tensor data at all levels are random, but at the higher levels the tensor data will likely end up in the buffer cache (if it is stored on disk) as they are frequently accessed.

geirst commented 3 years ago

We have implemented a prototype of an allocator that uses an mmapped file as backing for allocations. This is used to allocate space for the tensor data in a tensor attribute field used for nearest neighbor search over an HNSW index. We have tested the query and feed performance of this using a Dense Passage Retriever dataset from https://github.com/vespa-engine/sample-apps/tree/master/dense-passage-retrieval-with-ann. The tensor field is specified as:

field text_embedding type tensor<float>(x[769]){
    indexing: attribute|index
    attribute {
        distance-metric: euclidean
    }
    index {
        hnsw {
            max-links-per-node: 32
            neighbors-to-explore-at-insert: 500
        }
    }
}

The dataset consists of 21015300 documents and the tensor data size is 60.2 GiB. We used around 3000 raw nearestNeighbor queries searching _textembedding when benchmarking using vespa-fbench (https://docs.vespa.ai/en/performance/vespa-benchmarking.html).

We used a very powerful machine with 4 SSD disks in RAID 0 when testing this. We achieved up to 250000 disk reads per second and 1 GiB of data per second. We used two docker containers, one for the content node and one for the rest of the services.

We wanted to find query and feed performance as a function of how much of the tensor data was in memory vs how much was on disk. We started with a Docker container of 80 GiB memory and fed the entire data set in memory and performed query benchmarking. Then we re-deployed the application using the mmap file allocator, and restarted the content node to reload the tensor data using that allocation strategy instead. Then we gradually reduced the available memory for the Docker container, while restarting the content node and performing query benchmarking at each step. In the table below we summarise the query performance when using 32 clients (that maxed out the QPS). The Shared mem column indicates how much of the mmap file actually ended up in memory by the OS. The Mem vs disk factor indicates the amount of tensor data that is in memory vs on disk. At 1.00 everything is in memory.

Allocation strategy Container disk (GiB) Container mem (GiB) Shared mem (GiB) Mem vs disk factor Avg latency(ms) QPS Disk reads (per sec) Disk reads (KiB/sec) I/O Stat (% busy)
memory 195 80   1.00 16.62 1920.83      
mmap file 270 80 60.8 1.00 18.42 1733.79 0 0 0
mmap file 270 72 60.8 1.00 82.82 386.19 59784 239136 34.31
mmap file 270 64 54.3 0.90 138.61 230.81 204746 821890 85.95
mmap file 270 56 46.4 0.77 363.47 88.03 185956 765672 76.325
mmap file 270 48 38.4 0.64 440.2 72.69 152455 625430 62.4725
mmap file 270 40 30.4 0.50 623.75 51.3 164253 668870 67.058
mmap file 270 32 22.4 0.37 898.89 35.6 171588 695656 69.754
mmap file 270 24 14.4 0.24 1112.13 28.77 206354 833175 77.86
mmap file 270 20 10.4 0.17 1279.73 25 210739 849488 78.888
mmap file 270 16 7.3 0.12 1359.97 23.53 249281 1003448 90.274
mmap file 270 12 2.5 0.04 1596.76 20.04 257131 1033940 92.262

As seen, the QPS deteriorates quickly when more of the tensor data is on disk. At 50% the QPS is 51. At 96% the QPS is only 20. Also note how busy the disk is at this point. If only looking at QPS in isolation, these numbers are probably still fine for some use cases. When looking at feed performance however it collapses.

During feeding, we insert points into the HNSW index. The algorithm that locates the neighborhood for the new point in the graph is the same that is used when searching the graph. We used targetHits=10 and exploreAdditionalHits=490 when benchmarking queries, which is similar to the 500 neighbors explored when inserting into the HNSW index. This means that feed throughput can be estimated as the max QPS. The following table shows estimated feed time for some sample data set sizes: 21M, 100M and 600M. As seen, this will not work out in real life. 58 days to feed a dataset of 100M documents is way too much.

Allocation strategy Container disk (GiB) Container mem (GiB) Est Feed time 21M docs (hour) Est Feed time 100M docs (days) Est Feed time 600M docs (days) Disk and Mem cost per day ($) Total (VCPU=32) cost per day ($) Savings (%)(VCPU=32) Total (VCPU=8) cost per day ($) Savings (%)(VCPU=8)
memory 195 80 3.04 0.60 3.62 17.40 86.52   34.68  
mmap file 270 80 3.37 0.67 4.01 17.90 87.02 -0.58 35.18 -1.45
mmap file 270 72 15.12 3.00 17.98 16.29 85.41 1.28 33.57 3.19
mmap file 270 64 25.29 5.01 30.09 14.69 83.81 3.14 31.97 7.83
mmap file 270 56 66.31 13.15 78.89 13.08 82.20 5.00 30.36 12.47
mmap file 270 48 80.31 15.92 95.54 11.47 80.59 6.86 28.75 17.11
mmap file 270 40 113.79 22.56 135.37 9.86 78.98 8.72 27.14 21.75
mmap file 270 32 163.98 32.51 195.07 8.25 77.37 10.58 25.53 26.39
mmap file 270 24 202.91 40.23 241.38 6.64 75.76 12.44 23.92 31.03
mmap file 270 20 233.50 46.30 277.78 5.83 74.95 13.37 23.11 33.35
mmap file 270 16 248.09 49.19 295.13 5.03 74.15 14.30 22.31 35.67
mmap file 270 12 291.30 57.75 346.53 4.22 73.34 15.23 21.50 37.99
memory (bfloat16) 135 40       8.95 78.07 9.77 26.23 24.36

The table above also shows the total cost per day (in $) for a machine on Vespa Cloud (https://cloud.vespa.ai/pricing), and what we could expect to save if the tensor data is stored on disk instead of memory. Notice the last row where we estimate the memory usage and savings if we could use a 16-bit float (bfloat16) instead of a regular 32-bit float to store the tensor data in memory.

To summarise:

Alternatives:

geirst commented 3 years ago

As the conclusion above shows, using smaller tensor types (e.g. bfloat16) will be the preferred solution instead of storing tensor data on disk. Support for smaller tensor types is tracked in https://github.com/vespa-engine/vespa/issues/17118.