[RFC]Reduce Disk Usage By Reusing NativeEngine Files

luyuncheng commented 1 week ago

Description

before ISSUE #1572 AND MR #1571, we found that we reuse docValues field like KNNVectorFieldData and do synthetic logic in _source field. and it can save about 1/3 disk usage. Also @jmazanec15 mentioned a great method to implements storeFields as https://github.com/opensearch-project/k-NN/pull/1571#issuecomment-2438764949 says.

this RFC, i proposal a new method to reduce disk usage, that we can read nativeEngines files to create DocValues. so we can save the disk for skip write flatFieldVectorsWriter or BinaryDocValues

i read the Faiss code: faiss/impl/index_write.cpp it show that faiss HNSW32,Flat file structure like followings:


|-typeIDMap   -|-id_header-|
|-typeHnsw    -|-hnsw_header-|-hnswGraph-|
|-typeStorage -|-storage_Header-|-storageVector-|
|-idmap_vector-|-FOOTER_MAGIC+CHECKSUM-|

i implements a FaissEngineFlatVectorValues which read _0_2011_target_field.faissc files directly and wrap a DocIdSetIterator instead of using FlatVectorsReader. at POC code , it shows that, we can cut almost 50% disk usage for skip write flatVectors also without write flatVectors, write performance do a little optimize

in the next:

at 1st step i would support all faiss format as reader, also write flatVectors. #2267
at 2nd step, i see #2175 we would not create xxx.faiss file, we also need flatVectors when skipBuildingVector. i think we can do create either nativeFile or flatVectors, not create both.

navneet1v commented 1 week ago

at POC code , it shows that, we can cut almost 50% disk usage for skip write flatVectors also without write flatVectors, write performance do a little optimize

This is an interesting gain. I am wondering when you say 50% gain in disk space it will happen only in case when source is not enabled for vectors. Cutting down the flat vectors and just reading it via Faiss index has been discussed couple of times. My only concern with this is will reading flat vectors from Faiss file be as efficient as reading the flat vectors from .vec file?

Also, did we explore this option where we don't store/serialize flat vectors in Faiss and use the .vec file instead. It can also help this feature: https://github.com/opensearch-project/k-NN/issues/1693

@0ctopus13prime , @jmazanec15

jmazanec15 commented 1 week ago

This would be good savings! Like @navneet1v, Im wondering if itll be easier to leverage .vec in faiss as opposed to simulating .vec with faiss.

Also, for this plan, how will quantized vectors be handled, where we dont store the full precision vecs in faiss files?

luyuncheng commented 1 week ago

Also, did we explore this option where we don't store/serialize flat vectors in Faiss and use the .vec file instead. It can also help this feature: https://github.com/opensearch-project/k-NN/issues/1693

after #1693, we talked about that goal is want to merge vector into one storage, we took following 2 options into consideration @jmazanec15 @navneet1v talked.

option1 use lucene .vec in faiss
option2 use faiss .faiss in lucene

i think all these options, native engine do AnnSearch would be the same latency cause vectors all in memory, the only impacts for query latency are ExactSearch AND Merge.

and why i chose option2, because

i think we would introduce new engines in the future, we can not hacker all native engines for their storage format and io chain.
in option2, only have to know file format and read it directly, it decoupled from native engines code.

luyuncheng commented 1 week ago

Also, for this plan, how will quantized vectors be handled, where we dont store the full precision vecs in faiss files?

@jmazanec15 at 1st step, i skipped using faiss file as docvalues when it is quantized. because we can not get full precision vecs forsexact search. but i think we can use it for merge and save the faiss computation in sa_encode and sa_decode.

luyuncheng commented 1 week ago

I am wondering when you say 50% gain in disk space it will happen only in case when source is not enabled for vectors. Cutting down the flat vectors and just reading it via Faiss index has been discussed couple of times. My only concern with this is will reading flat vectors from Faiss file be as efficient as reading the flat vectors from .vec file?

@navneet1v good question, i will do some benchmark for different types

luyuncheng commented 22 hours ago

I am wondering when you say 50% gain in disk space it will happen only in case when source is not enabled for vectors. Cutting down the flat vectors and just reading it via Faiss index has been discussed couple of times. My only concern with this is will reading flat vectors from Faiss file be as efficient as reading the flat vectors from .vec file?

@navneet1v good question, i will do some benchmark for different types

i did some mini benchmark for file size and iterator all docs tests as following show:

TestsCase	Engine	FileSize	Latency of Iterator ALL	percentage
100000Docs,128Dims	Lucene(.vec+.vemf)	50000KB	53ms
	Faiss(.faiss)	50000KB(exclude graph)	33ms	-37%
100000Docs,768Dims	Lucene(.vec+.vemf)	300000KB	165ms
	Faiss(.faiss)	300000KB	82ms	-50%

@navneet1v @jmazanec15 because in Lucene99HnswVectorsFormat in dense vector, would be flat vector. so the file size equals to faiss flat. any options: use lucene .vec in faiss OR use faiss .faiss in lucene we can save 50% disk usage at vector file.

i also did trace the iterator latency. i think faiss file is too simple so we can iterator faster than lucene, and also because i put the IDMap in the memory, so FaissEngineFlatKnnVectorsReader can read faster with sequential io read in iterator and with less iops)

opensearch-project / k-NN

[RFC]Reduce Disk Usage By Reusing NativeEngine Files #2266

Description