For HSNW search, load vectors on demand

deepestthought42 commented 3 years ago

Hi!

While evaluating vespa for a our product we ran into a limitation regarding the size of the HSNW index. Currently, with a rather small (for us) dataset and 512 dimensional vectors to search for, the index has a size of 73GB in memory. Would it be possible to replace the full vectors in the index with a pointer into the file(s) and load the vectors from the file ondemand (with some caching maybe). While this certainly reduces performance greatly, using fast disks and appropriate file chunking might still make it more than fast enough for us with a way better performance to cost ratio.

Thanks!

jobergum commented 3 years ago

@deepestthought42 Thank you for the detailed request. We think we should be able to come up with a solution, no timeline yet though.

arnej27959 commented 3 years ago

I hope you are already using floats (to cut memory usage in half), if not you should do that.

geirst commented 3 years ago

Example on how to specify a float tensor type (https://docs.vespa.ai/documentation/reference/tensor.html#tensor-type-spec):

field my_tensor type tensor<float>(x[512]) {
    indexing: attribute | summary
}

deepestthought42 commented 3 years ago

Thanks for taking this on! @arnej27959 , yes we are using floats exactly as specified above.

bratseth commented 3 years ago

We believe it will be 2-4 weeks to have something ready for trying out (from when we decide to start working on it). @geirst is planning to run some experiment to get an idea of how this would impact performance in practice.

peakji commented 3 years ago

+1 for on-disk ANN indices. AFAIK, theres no open source solution with out-of-the-box support for this right now and it could be a really unique feature for vespa.

Microsoft Research released the DiskANN paper last year, they've proved that on-disk ANN searching is feasible. The paper achieved optimal performance by introducing a new graph building algorithm which helps to reduce disk seeks, but I guess we could still stick to HNSW with on-demand loading, it might be sufficient for various real world use cases which are not that time sensitive, e.g. offline de-duplication and recommendation.

geirst commented 3 years ago

One possible solution is to store the document tensors (vector data) on disk, while the HNSW index is still kept in memory. The tensor field would still be represented as an attribute vector in the search backend, but with a new allocator used when allocating space for the tensor data. This new allocator would use memory-mapped files as backing for the allocations, effectively storing all tensors on disk. The benefits of such solution is that very little of the existing code must be changed. Both feed and search pipelines would be the same.

The question is how good search performance we can achieve if the tensors are stored on disk. The main cost of the HNSW search algorithm is to calculate distances between the query vector and document vectors. I have done some experiments with the SIFT (128 dim) and GIST (960 dim) data sets (http://corpus-texmex.irisa.fr/), to see how many distance calculations we perform on average per query. For SIFT, all 1M documents were fed. For GIST, 300k of 1M documents were fed. In the following table we see the average number of distance calculations per query. Around 1000 nearestNeighbor queries were executed per case.

Dataset	targetHits=10	targetHits=100
SIFT	384	1805
GIST	407	1982

If the tensors are stored on disk, we need to read the document tensor before doing the distance calculation. Assume the disk supports 400 MB of random reads per second. The block size is 4 kB, which means we can read 100k blocks per second. A SIFT tensor is 512 bytes and a GIST tensor is 3840 bytes, which means they both fit a 4 kB block. For simplicity, assume 400 distance calculations per query when targetHits=10, and 2000 when targetHits=100.

This gives a maximum theoretical 250 QPS when targetHits=10, and 50 QPS when targetHits=100 on a single node. For some applications this might be a trade-off worth taking. For comparison, we currently achieve 9500 QPS (end-to-end) when searching the SIFT data set with targetHits=100 on our performance test nodes (24 cores at 2.50 GHz).

geirst commented 3 years ago

I have performed the SIFT (128 dim) experiment again with more detailed sampling of distance calculations. In this case the sampling is done per level in the HNSW graph. This is summarised in the following table:

Level	Vector points	Avg dist calc (targetHits=10)	Avg dist calc (targetHits=100)
0	1000000	281	1708
1	62471	32	32
2	3994	32	32
3	257	28	28
4	16	13	13
5	1	1	1
Total	-	387	1814

As seen, the majority of calculations are performed at level 0 where all vector points exist, and the number of calculations depend on targetHits. The access of tensor data at all levels are random, but at the higher levels the tensor data will likely end up in the buffer cache (if it is stored on disk) as they are frequently accessed.

geirst commented 3 years ago

We have implemented a prototype of an allocator that uses an mmapped file as backing for allocations. This is used to allocate space for the tensor data in a tensor attribute field used for nearest neighbor search over an HNSW index. We have tested the query and feed performance of this using a Dense Passage Retriever dataset from https://github.com/vespa-engine/sample-apps/tree/master/dense-passage-retrieval-with-ann. The tensor field is specified as:

field text_embedding type tensor<float>(x[769]){
    indexing: attribute|index
    attribute {
        distance-metric: euclidean
    }
    index {
        hnsw {
            max-links-per-node: 32
            neighbors-to-explore-at-insert: 500
        }
    }
}

The dataset consists of 21015300 documents and the tensor data size is 60.2 GiB. We used around 3000 raw nearestNeighbor queries searching _textembedding when benchmarking using vespa-fbench (https://docs.vespa.ai/en/performance/vespa-benchmarking.html).

We used a very powerful machine with 4 SSD disks in RAID 0 when testing this. We achieved up to 250000 disk reads per second and 1 GiB of data per second. We used two docker containers, one for the content node and one for the rest of the services.

System: Lenovo ThinkSystem SR635 (LNVO System), C-xz/512/0
Processors: 1 x EPYC 7662 64-Core 2.00GHz (64 cores, 128 threads)
Memory: 503GB / 512GB 3200MT/s DDR4 == 16 x 32GB
Disk: dm0: SW-RAID-0 == partitions: nvme0n1, nvme1n1, nvme2n1, nvme3n1
Disk-Control: nvmeX: Intel NVMe Datacenter SSD

We wanted to find query and feed performance as a function of how much of the tensor data was in memory vs how much was on disk. We started with a Docker container of 80 GiB memory and fed the entire data set in memory and performed query benchmarking. Then we re-deployed the application using the mmap file allocator, and restarted the content node to reload the tensor data using that allocation strategy instead. Then we gradually reduced the available memory for the Docker container, while restarting the content node and performing query benchmarking at each step. In the table below we summarise the query performance when using 32 clients (that maxed out the QPS). The Shared mem column indicates how much of the mmap file actually ended up in memory by the OS. The Mem vs disk factor indicates the amount of tensor data that is in memory vs on disk. At 1.00 everything is in memory.

Allocation strategy	Container disk (GiB)	Container mem (GiB)	Shared mem (GiB)	Mem vs disk factor	Avg latency(ms)	QPS	Disk reads (per sec)	Disk reads (KiB/sec)	I/O Stat (% busy)
memory	195	80		1.00	16.62	1920.83
mmap file	270	80	60.8	1.00	18.42	1733.79	0	0	0
mmap file	270	72	60.8	1.00	82.82	386.19	59784	239136	34.31
mmap file	270	64	54.3	0.90	138.61	230.81	204746	821890	85.95
mmap file	270	56	46.4	0.77	363.47	88.03	185956	765672	76.325
mmap file	270	48	38.4	0.64	440.2	72.69	152455	625430	62.4725
mmap file	270	40	30.4	0.50	623.75	51.3	164253	668870	67.058
mmap file	270	32	22.4	0.37	898.89	35.6	171588	695656	69.754
mmap file	270	24	14.4	0.24	1112.13	28.77	206354	833175	77.86
mmap file	270	20	10.4	0.17	1279.73	25	210739	849488	78.888
mmap file	270	16	7.3	0.12	1359.97	23.53	249281	1003448	90.274
mmap file	270	12	2.5	0.04	1596.76	20.04	257131	1033940	92.262

As seen, the QPS deteriorates quickly when more of the tensor data is on disk. At 50% the QPS is 51. At 96% the QPS is only 20. Also note how busy the disk is at this point. If only looking at QPS in isolation, these numbers are probably still fine for some use cases. When looking at feed performance however it collapses.

During feeding, we insert points into the HNSW index. The algorithm that locates the neighborhood for the new point in the graph is the same that is used when searching the graph. We used targetHits=10 and exploreAdditionalHits=490 when benchmarking queries, which is similar to the 500 neighbors explored when inserting into the HNSW index. This means that feed throughput can be estimated as the max QPS. The following table shows estimated feed time for some sample data set sizes: 21M, 100M and 600M. As seen, this will not work out in real life. 58 days to feed a dataset of 100M documents is way too much.

Allocation strategy	Container disk (GiB)	Container mem (GiB)	Est Feed time 21M docs (hour)	Est Feed time 100M docs (days)	Est Feed time 600M docs (days)	Disk and Mem cost per day ($)	Total (VCPU=32) cost per day ($)	Savings (%)(VCPU=32)	Total (VCPU=8) cost per day ($)	Savings (%)(VCPU=8)
memory	195	80	3.04	0.60	3.62	17.40	86.52		34.68
mmap file	270	80	3.37	0.67	4.01	17.90	87.02	-0.58	35.18	-1.45
mmap file	270	72	15.12	3.00	17.98	16.29	85.41	1.28	33.57	3.19
mmap file	270	64	25.29	5.01	30.09	14.69	83.81	3.14	31.97	7.83
mmap file	270	56	66.31	13.15	78.89	13.08	82.20	5.00	30.36	12.47
mmap file	270	48	80.31	15.92	95.54	11.47	80.59	6.86	28.75	17.11
mmap file	270	40	113.79	22.56	135.37	9.86	78.98	8.72	27.14	21.75
mmap file	270	32	163.98	32.51	195.07	8.25	77.37	10.58	25.53	26.39
mmap file	270	24	202.91	40.23	241.38	6.64	75.76	12.44	23.92	31.03
mmap file	270	20	233.50	46.30	277.78	5.83	74.95	13.37	23.11	33.35
mmap file	270	16	248.09	49.19	295.13	5.03	74.15	14.30	22.31	35.67
mmap file	270	12	291.30	57.75	346.53	4.22	73.34	15.23	21.50	37.99
memory (bfloat16)	135	40				8.95	78.07	9.77	26.23	24.36

The table above also shows the total cost per day (in $) for a machine on Vespa Cloud (https://cloud.vespa.ai/pricing), and what we could expect to save if the tensor data is stored on disk instead of memory. Notice the last row where we estimate the memory usage and savings if we could use a 16-bit float (bfloat16) instead of a regular 32-bit float to store the tensor data in memory.

To summarise:

We cannot use mmap file allocation for tensor data as the feed throughput is way too low.
- Only 20 docs/sec on really powerful hardware (4 SSD disks in RAID 0).
- Feeding a 100M dataset would take 58 days!
The actual cost savings are not that huge.
- 15-35% savings based on prices from https://cloud.vespa.ai/pricing and the number of VCPUS needed.

Alternatives:

Add support for bfloat16 as a tensor data type to cut memory usage in half.
- This saves around 25% in cost compared to the baseline.

geirst commented 3 years ago

As the conclusion above shows, using smaller tensor types (e.g. bfloat16) will be the preferred solution instead of storing tensor data on disk. Support for smaller tensor types is tracked in https://github.com/vespa-engine/vespa/issues/17118.

vespa-engine / vespa

For HSNW search, load vectors on demand #14930