microsoft / MSVBASE

MSVBASE is a system that efficiently supports complex queries of both approximate similarity search and relational operators. It integrates high-dimensional vector indices into PostgreSQL, a relational database to facilitate complex approximate similarity queries.
MIT License
85 stars 8 forks source link

Spann Index doesn't work. #12

Closed JackTan25 closed 8 months ago

JackTan25 commented 9 months ago

select * from t1 order by a <-> ARRAY[0,0,0,1,8,7,3,2,5,0,0,3,5,7,11,31,13,0,0,0,0,29,106,107,13,0,0,0,1,61,70,42,0,0,0,0,1,23,28,16,63,4,0,0,0,6,83,81,117,86,25,15,17,50,84,117,31,23,18,35,97,117,49,24,68,27,0,0,0,4,29,71,81,47,13,10,32,87,117,117,45,76,40,22,60,70,41,9,7,21,29,39,53,21,4,1,55,72,3,0,0,0,0,9,65,117,73,37,28,23,17,34,11,11,27,61,64,25,4,0,42,13,1,1,1,14,10,6] limit 5; server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. The connection to the server was lost. Attempting reset: Failed.

What’s wrong with that? @zqxjjj

JackTan25 commented 9 months ago

the log is here.

2024-02-27 10:22:39.430 UTC [45] LOG: server process (PID 101) was terminated by signal 11: Segmentation fault 2024-02-27 10:22:39.430 UTC [45] DETAIL: Failed process was running: select * from t1 order by a <-> ARRAY[0,0,0,1,8,7,3,2,5,0,0,3,5,7,11,31,13,0,0,0,0,29,106,107,13,0,0,0,1,61,70,42,0,0,0,0,1,23,28,16,63,4,0,0,0,6,83,81,117,86,25,15,17,50,84,117,31,23,18,35,97,117,49,24,68,27,0,0,0,4,29,71,81,47,13,10,32,87,117,117,45,76,40,22,60,70,41,9,7,21,29,39,53,21,4,1,55,72,3,0,0,0,0,9,65,117,73,37,28,23,17,34,11,11,27,61,64,25,4,0,42,13,1,1,1,14,10,6] limit 5; 2024-02-27 10:22:39.430 UTC [45] LOG: terminating any other active server processes 2024-02-27 10:22:39.430 UTC [72] WARNING: terminating connection because of crash of another server process 2024-02-27 10:22:39.430 UTC [72] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory. 2024-02-27 10:22:39.430 UTC [72] HINT: In a moment you should be able to reconnect to the database and repeat your command. 2024-02-27 10:22:39.432 UTC [103] FATAL: the database system is in recovery mode 2024-02-27 10:22:39.434 UTC [45] LOG: all server processes terminated; reinitializing 2024-02-27 10:22:39.478 UTC [104] LOG: database system was interrupted; last known up at 2024-02-27 10:12:27 UTC 2024-02-27 10:22:39.551 UTC [104] LOG: database system was not properly shut down; automatic recovery in progress 2024-02-27 10:22:39.553 UTC [104] LOG: redo starts at 0/40E00168 2024-02-27 10:22:39.553 UTC [104] LOG: invalid record length at 0/40E001A0: wanted 24, got 0 2024-02-27 10:22:39.553 UTC [104] LOG: redo done at 0/40E00168 2024-02-27 10:22:39.561 UTC [45] LOG: database system is ready to accept connections

zqxjjj commented 9 months ago

Thanks for your report. @JackTan25

Yes. As described in README, SPANN will be integrated soon because it depends on SPFresh to offer index insert and update. The work is in this PR https://github.com/microsoft/SPTAG/pull/406. Mainly because the current code of SPANN makes it challenging to insert data.

Now it can also be done, but there are more intricate steps in our evaluation process.

  1. Create SPANN index files using SPANN out of VBase.
  2. Create SPANN index by SQL command in VBase. Then there are only the meta data and the fake index directory.
  3. Copy SPANN index files into VBase index data directory.
  4. Modify the SPANN index configure file to add meta data path. [MetaData] MetaDataFilePath= MetaDataIndexPath= [Base] ValueType=Float DistCalcMethod=L2 IndexAlgoType=BKT Dim=128 VectorPath=/tmp/sift/sift_base.fvecs VectorType=XVEC VectorSize=1000000 VectorDelimiter= QueryPath=/tmp/sift/sift_query.fvecs QueryType=XVEC QuerySize=100 QueryDelimiter= WarmupPath= WarmupType=DEFAULT WarmupSize=10000 WarmupDelimiter= TruthPath=/groundtruth TruthType=DEFAULT GenerateTruth=false HeadVectorIDs=head_vectors_ID_Int8_L2_base_DEFUALT.bin HeadVectors=head_vectors_Int8_L2_base_DEFUALT.bin IndexDirectory=/tmp/spann_index HeadIndexFolder=head_index [SelectHead] isExecute=false TreeNumber=1 BKTKmeansK=32 BKTLeafSize=8 SamplesNumber=10000 SaveBKT=false SelectThreshold=10 SplitFactor=6 SplitThreshold=25 Ratio=0.18 NumberOfThreads=160 BKTLambdaFactor=1.0 [BuildHead] isExecute=false NeighborhoodSize=32 TPTNumber=64 TPTLeafSize=2000 MaxCheck=16324 MaxCheckForRefineGraph=16324 RefineIterations=3 NumberOfThreads=160 BKTLambdaFactor=-1.0 [BuildSSDIndex] isExecute=false BuildSsdIndex=false NumberOfThreads=160 InternalResultNum=256 ReplicaCount=8 PostingPageLimit=120 OutputEmptyReplicaID=1 [SearchSSDIndex] isExecute=true BuildSsdIndex=false InternalResultNum=256 SearchInternalResultNum=256 NumberOfThreads=16 SearchResult=/data/result.bin QpsLimit=0 ResultNum=50 TruthResultNum=50 MaxCheck=8192 SearchPostingPageLimit=120 MaxDistRatio=10000 Rerank=100 EnableADC=false RecallAnalysis=true DebugBuildInternalResultNum=256
JackTan25 commented 9 months ago

Thanks for your replying,But I'm still confused

  1. how to create spann index files out of VBase, I can't find out the method from the README
  2. for the step 4, I don't know where is the configure file, and do you mean I just copy the content you give above into the configure file directly, the VBase Paper's test of SPANN is like this way?

@zqxjjj

zqxjjj commented 9 months ago
  1. how to create spann index files out of VBase, I can't find out the method from the README Create SPANN index files using SPANN out of VBase. That means create index files by SPANN repo. https://github.com/microsoft/SPTAG This is unrelated to VBase. Just use SPANN to create index from the dataset.
  2. for the step 4, I don't know where is the configure file, and do you mean I just copy the content you give above into the configure file directly, the VBase Paper's test of SPANN is like this way? This is also related to SPANN. https://github.com/microsoft/SPTAG. For SPANN, it needs a config file to search which is related to the dataset. https://github.com/microsoft/SPTAG/blob/main/AnnService/src/Core/VectorIndex.cpp#L626 VBase uses the same way to query SPANN index. https://github.com/microsoft/MSVBASE/blob/main/src/spannindex_scan.cpp#L9
JackTan25 commented 9 months ago

well, I follow this document https://github.com/microsoft/SPTAG/blob/main/docs/GettingStart.md, 1.I can't find the [MetaData] MetaDataFilePath= MetaDataIndexPath= [Base] ValueType=Float DistCalcMethod=L2 ..... you give above, but I can find the configure file here.:

image
  1. What's the difference of meta.bin and metaindex.bin, I can see the metaindex.bin means the offset, but what's vector 1 meta,vector 2 meta, I can't find the explanation in the ReadMe.md.

    image
  2. same here, what is the semantic of metadata:

    image
  3. can I replace all of these bin file with my own txt format files, right:

    image

by the way, is there a WeChat user group or other ways to communicate? @zqxjjj

zqxjjj commented 9 months ago

Thanks for your feedback. @JackTan25

I am not an expert on SPANN. But I can share all that I know. 1, Some items for build and search are different in the config file. And it is related to the dataset and not related to VBase. 2&3, There is an address pointing to the row in the table for each item in the index. That is how meta data is used in VBase. Of course, it can be used for other motivations. Each vector has a meta data item in SPANN. 4, It depends on the format in the txt file. SPANN supports several data format. https://github.com/microsoft/SPTAG/tree/main/AnnService/src/Helper/VectorSetReaders 5, Which way do you think will offer more efficient communication? I am very open to exploring better communication paradigms. GitHub provides an excellent platform for communication.

JackTan25 commented 9 months ago

So the meta is generated by SPann not related to Vbase, And I also donn't need to build it mually. right? @zqxjjj

if you can give me the way to reproduce the result in VBase Paper, Maybe give me the detailed steps one by one, I think that's better, In the Sptag repo, the readme is too complex, there are too much parameters, as you said above, the parameter in the configure file are different with Vbase, I have fallen into a trouble in the reproduce. @zqxjjj

zqxjjj commented 9 months ago

Yes. Creating an index in SPANN is a little complex. Let me figure out how to offer some tools to make it automated.

zqxjjj commented 8 months ago

-> SPTAG/Release/ssdserving buildIndex.ini Example content in buildIndex.ini [Base] ValueType=Float DistCalcMethod=L2 IndexAlgoType=BKT Dim=1025 VectorPath=/raw_data/collections/rec_embeds_collection_spann.bin VectorType=DEFAULT VectorSize=330922 VectorDelimiter= QueryPath=/artifacts/scripts/data_prepare/new_image_embedding_query.bin QueryType=DEFAULT QuerySize=100 QueryDelimiter= WarmupPath= WarmupType=DEFAULT WarmupSize=10000 WarmupDelimiter= TruthPath=/groundtruth TruthType=DEFAULT GenerateTruth=false HeadVectorIDs=head_vectors_ID_UInt8_L2_base_DEFUALT.bin HeadVectors=head_vectors_UInt8_L2_base_DEFUALT.bin IndexDirectory=/raw_data/data HeadIndexFolder=head_index [SelectHead] isExecute=true TreeNumber=1 BKTKmeansK=32 BKTLeafSize=8 SamplesNumber=10000 SaveBKT=false SelectThreshold=10 SplitFactor=6 SplitThreshold=25 Ratio=0.18 NumberOfThreads=160 BKTLambdaFactor=1.0 [BuildHead] isExecute=true NeighborhoodSize=32 TPTNumber=64 TPTLeafSize=2000 MaxCheck=16324 MaxCheckForRefineGraph=16324 RefineIterations=3 NumberOfThreads=160 BKTLambdaFactor=-1.0 [BuildSSDIndex] isExecute=true BuildSsdIndex=true NumberOfThreads=160 InternalResultNum=256 ReplicaCount=8 PostingPageLimit=120 OutputEmptyReplicaID=1 [SearchSSDIndex] isExecute=false BuildSsdIndex=true InternalResultNum=256 SearchInternalResultNum=256 NumberOfThreads=16 SearchResult=/data/result.bin QpsLimit=0 ResultNum=50 TruthResultNum=50 MaxCheck=8192 SearchPostingPageLimit=120 MaxDistRatio=10000 Rerank=100 EnableADC=false RecallAnalysis=true DebugBuildInternalResultNum=256

-> create index image_spann_index on recipe_table using spann(image_embedding spann_vector_l2_ops); The meta data will be in the index directory.

-> cp -r /raw_data/data/ /indexdata/image_spann_index/ -> cp /u02/pgdata/13/base/16386/meta /indexdata/image_spann_index/

JackTan25 commented 8 months ago

we can get it successfully by following above. But I make a mistake here, I forget to do chmod for /indexdata/xxxx/meta*.bin because I use postgres user to start this. Otherwise we will get Failed to create file handle:/indexdata/image_spann_index/meta.bin at AsyncFileReader.h.

postgres=# select * from t3 order by a <-> '{0.3,0.4,0.5}' limit 1;
INFO:  try begin scan,path: /image_spann_index/
INFO:  try begin scan successfully.
INFO:  finished spann search
                 a                  
------------------------------------
 {0.95990366,0.95319396,0.99043304}
(1 row)

Time: 22.892 ms

So if you are trying to start it as another user and make new database folder by yourself. Please see my error. For now, the spann index can work successfully. Let's close this issue.