microsoft / SPTAG

A distributed approximate nearest neighborhood search (ANN) library which provides a high quality vector index build, search and distributed online serving toolkits for large scale vector search scenario.
MIT License
4.83k stars 581 forks source link

thread local context (#359) causes test issue on Linux: `1: [4] fid:0 channel 2, to submit:64, submitted:Operation not permitted` #375

Open pabs3 opened 1 year ago

pabs3 commented 1 year ago

On Linux, #359 by @PhilipBAdams causes the SSDServingTest TestSearchSSDIndexFloatL2BKTDEFAULTTXT test to continously print this message over and over again:

 1: [4] fid:0 channel 2, to submit:64, submitted:Operation not permitted

This message comes from the BatchReadFileAsync function in AnnService/src/Helper/AsyncFileReader.cpp and is printed when the io_submit syscall fails:

int s = syscall(__NR_io_submit, handler->GetIOCP(channel), iocbs[i].size() - submitted[i], iocbs[i].data() + submitted[i]);
if (s > 0) {
    submitted[i] += s;
    totalSubmitted += s;
}
else {
    SPTAGLIB_LOG(Helper::LogLevel::LL_Error, "fid:%d channel %d, to submit:%d, submitted:%s\n", i, channel, iocbs[i].size() - submitted[i], strerror(-s));
}

According to the io_submit manual page, this indicates incorrectly setting the RT IO priority class:

EPERM  The aio_reqprio field is set with the class IOPRIO_CLASS_RT, but the submitting context does not have the CAP_SYS_ADMIN capability.

However, the code does not set the IO priority class anywhere.

My conclusion is that maybe the iocb structure being passed to io_submit gets corrupted somehow, but I am not sure.

Thoughts and ideas for further debugging of this are welcome.

I also tried commenting out the contents of the Helper::SetThreadAffinity function but that didn't work.

pabs3 commented 1 year ago

Enabling use of libnuma doesn't fix this issue.

pabs3 commented 1 year ago

Reverting just the changes to AnnService/inc/SSDServing/SSDIndex.h in commit 5ffeabe4d0df57a6d4b870f53457ea5d20986061 didn't work either.

PhilipBAdams commented 1 year ago

Can you log the values of the iocbs at the time you get the error message? The value of aio_reqprio should be 0 from the memset, which maps to IOPRIO_CLASS_NONE.

What's the compiler/platform? In our linux unit test run (Ubuntu 20.04, GCC 9.4.0) we don't encounter this log.

pabs3 commented 1 year ago

Not sure if I got the code right (see the async-log-submissions branch in my fork), but I tried logging aio_reqprio and it was always zero, so probably my initial diagnosis might have been incorrect.

This is on Debian 12 bookworm and GCC 12.2.0.

I'll try some older versions of Debian today.

-- bye, pabs

https://bonedaddy.net/pabs3/

pabs3 commented 1 year ago

Same error with Debian 11 bullseye and GCC 10.2.1.

-- bye, pabs

https://bonedaddy.net/pabs3/

pabs3 commented 1 year ago

After modifying the code to log all io_submit requests, this is what the output looks like:

./Test/src/SSDServingTest.cpp(621): Entering test case "TestSearchSSDIndexFloatL2BKTDEFAULTTXT"
[Thread 0x7ffff5f9a6c0 (LWP 1940611) exited]
sddtest/test_search_ssd_Float_L2_BKT.builder.ini was generated. Skip generation.
[1] Set QuantizerFile = 
[1] Setting Dim with value 100
[1] Setting DistCalcMethod with value L2
[1] Setting GenerateTruth with value true
[1] Setting HeadIndexFolder with value head_Float_L2_BKT_DEFAULT.head_index
[1] Setting HeadVectorIDs with value head_ids_Float_L2_DEFAULT.bin
[1] Setting HeadVectors with value 
[1] Setting IndexAlgoType with value BKT
[1] Setting IndexDirectory with value zbtest
[1] Setting QueryDelimiter with value 
[1] Setting QueryPath with value sddtest/vectors_Float_DEFAULT.query
[1] Setting QuerySize with value 10
[1] Setting QueryType with value DEFAULT
[1] Setting SSDIndex with value ssd_Float_L2_BKT_DEFAULT.ssd_index
[1] Setting TruthPath with value sddtest/vectors_Float_L2_DEFAULT_TXT.truth
[1] Setting TruthType with value TXT
[1] Setting ValueType with value Float
[1] Setting VectorDelimiter with value 
[1] Setting VectorPath with value sddtest/vectors_Float_DEFAULT.bin
[1] Setting VectorSize with value 1000
[1] Setting VectorType with value DEFAULT
[1] Setting WarmupDelimiter with value 
[1] Setting WarmupPath with value sddtest/vectors_Float_DEFAULT.query
[1] Setting WarmupSize with value 10
[1] Setting WarmupType with value DEFAULT
[1] Setting SearchInternalResultNum with value 64
[1] Setting BuildSsdIndex with value false
[1] Setting isExecute with value true
[1] Setting LogFile with value 
[1] Setting MaxCheck with value 2048
[1] Setting NumberOfThreads with value 2
[1] Setting QpsLimit with value 0
[1] Setting QueryCountLimit with value 10000
[1] Setting ResultNum with value 32
[1] Setting SearchPostingPageLimit with value 2
[1] Setting SearchResult with value sddtest/test_search_ssd_Float_L2_BKT_DEFAULT_TXT.result
[1] Load Vector(1000,100)
[1] Begin Select Head...
[1] select head time: 0.00s
[1] Begin Build Head...
[1] select head time: 0.00s build head time: 0.00s
[1] Begin Build SSDIndex...
[1] Setting TreeFilePath with value tree.bin
[1] Setting GraphFilePath with value graph.bin
[1] Setting VectorFilePath with value vectors.bin
[1] Setting DeleteVectorFilePath with value deletes.bin
[1] Setting EnableBfs with value 0
[1] Setting BKTNumber with value 1
[1] Setting BKTKmeansK with value 32
[1] Setting BKTLeafSize with value 8
[1] Setting Samples with value 1000
[1] Setting BKTLambdaFactor with value 100.000000
[1] Setting TPTNumber with value 32
[1] Setting TPTLeafSize with value 2000
[1] Setting NumTopDimensionTpTreeSplit with value 5
[1] Setting NeighborhoodSize with value 32
[1] Setting GraphNeighborhoodScale with value 2.000000
[1] Setting GraphCEFScale with value 2.000000
[1] Setting RefineIterations with value 2
[1] Setting EnableRebuild with value 0
[1] Setting CEF with value 1000
[1] Setting AddCEF with value 500
[1] Setting MaxCheckForRefineGraph with value 8192
[1] Setting RNGFactor with value 1.000000
[1] Setting GPUGraphType with value 2
[1] Setting GPURefineSteps with value 0
[1] Setting GPURefineDepth with value 30
[1] Setting GPULeafSize with value 500
[1] Setting HeadNumGPUs with value 1
[1] Setting TPTBalanceFactor with value 2
[1] Setting NumberOfThreads with value 2
[1] Setting DistCalcMethod with value L2
[1] Setting DeletePercentageForRefine with value 0.400000
[1] Setting AddCountForRebuild with value 1000
[1] Setting MaxCheck with value 8192
[1] Setting ThresholdOfNumberOfContinuousNoBetterPropagation with value 3
[1] Setting NumberOfInitialDynamicPivots with value 50
[1] Setting NumberOfOtherDynamicPivots with value 4
[1] Setting HashTableExponent with value 2
[1] Setting DataBlockSize with value 1048576
[1] Setting DataCapacity with value 2147483647
[1] Setting MetaRecordSize with value 10
[1] Load Vector (202,100) Finish!
[1] Load BKT (1,204) Finish!
[1] Load RNG (202,32) Finish!
[1] Load DeleteID (202,1) Finish!
[New Thread 0x7ffff5f9a6c0 (LWP 1940614)]
[1] Setting NumberOfThreads with value 2
[1] Setting MaxCheck with value 2048
[1] Setting HashTableExponent with value 4
[1] Finish reading header info, list count 202, total doc count 1000, dimension 100, list page offset 1.
[1] Big page (>4K): list count 138, total element count 2232.
[1] Total Element Count: 2666
[1] Page Count Dist: 0 1
[1] Page Count Dist: 1 63
[1] Page Count Dist: 2 138
[1] select head time: 0.00s build head time: 0.00s build ssd time: 0.00s
[1] Start generating truth. It's maybe a long time.
[1] Load Vector(1000,100)
[1] Load Vector(10,100)
[1] Begin to generate truth for query(10,100) and doc(1000,100)...
[1] Start to write truth file...
[1] End generating truth.
[1] Start loading warmup query set...
[1] Load Vector(10,100)
[1] Start warmup...
[1] Searching: numThread: 2, numQueries: 10.
[New Thread 0x7ffff2f976c0 (LWP 1940615)]
[1] Sent 0.00%...
[New Thread 0x7ffff0f906c0 (LWP 1940616)]
[4] io_submit ctx_id=140737353871360 nr=64 data=140737091778224 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140737090953216 nbytes=8192 offset=368640 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0 
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0 
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0 
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0 
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0 
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0 
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0 
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0 
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
pabs3 commented 1 year ago

I noticed that when I enable only the AlgoTest/SPANNTest and SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT tests then the SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT test fails:

BOOST_TEST_RUN_FILTERS=AlgoTest/SPANNTest:SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT ../SPTAGTest

I also noticed that enabling all tests except the AlgoTest/SPANNTest test means all the tests pass, including SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT:

BOOST_TEST_RUN_FILTERS='!AlgoTest/SPANNTest' ../SPTAGTest

So I think that AlgoTest/SPANNTest is somehow interfering with SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT.

The successful SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT test output looks like this:

./Test/src/SSDServingTest.cpp(621): Entering test case "TestSearchSSDIndexFloatL2BKTDEFAULTTXT"
sddtest/test_search_ssd_Float_L2_BKT.builder.ini was generated. Skip generation.
[1] Set QuantizerFile = 
[1] Setting Dim with value 100
[1] Setting DistCalcMethod with value L2
[1] Setting GenerateTruth with value true
[1] Setting HeadIndexFolder with value head_Float_L2_BKT_DEFAULT.head_index
[1] Setting HeadVectorIDs with value head_ids_Float_L2_DEFAULT.bin
[1] Setting HeadVectors with value 
[1] Setting IndexAlgoType with value BKT
[1] Setting IndexDirectory with value zbtest
[1] Setting QueryDelimiter with value 
[1] Setting QueryPath with value sddtest/vectors_Float_DEFAULT.query
[1] Setting QuerySize with value 10
[1] Setting QueryType with value DEFAULT
[1] Setting SSDIndex with value ssd_Float_L2_BKT_DEFAULT.ssd_index
[1] Setting TruthPath with value sddtest/vectors_Float_L2_DEFAULT_TXT.truth
[1] Setting TruthType with value TXT
[1] Setting ValueType with value Float
[1] Setting VectorDelimiter with value 
[1] Setting VectorPath with value sddtest/vectors_Float_DEFAULT.bin
[1] Setting VectorSize with value 1000
[1] Setting VectorType with value DEFAULT
[1] Setting WarmupDelimiter with value 
[1] Setting WarmupPath with value sddtest/vectors_Float_DEFAULT.query
[1] Setting WarmupSize with value 10
[1] Setting WarmupType with value DEFAULT
[1] Setting SearchInternalResultNum with value 64
[1] Setting BuildSsdIndex with value false
[1] Setting isExecute with value true
[1] Setting LogFile with value 
[1] Setting MaxCheck with value 2048
[1] Setting NumberOfThreads with value 2
[1] Setting QpsLimit with value 0
[1] Setting QueryCountLimit with value 10000
[1] Setting ResultNum with value 32
[1] Setting SearchPostingPageLimit with value 2
[1] Setting SearchResult with value sddtest/test_search_ssd_Float_L2_BKT_DEFAULT_TXT.result
[1] Load Vector(1000,100)
[1] Begin Select Head...
[1] select head time: 0.00s
[1] Begin Build Head...
[1] select head time: 0.00s build head time: 0.00s
[1] Begin Build SSDIndex...
[1] Setting TreeFilePath with value tree.bin
[1] Setting GraphFilePath with value graph.bin
[1] Setting VectorFilePath with value vectors.bin
[1] Setting DeleteVectorFilePath with value deletes.bin
[1] Setting EnableBfs with value 0
[1] Setting BKTNumber with value 1
[1] Setting BKTKmeansK with value 32
[1] Setting BKTLeafSize with value 8
[1] Setting Samples with value 1000
[1] Setting BKTLambdaFactor with value 100.000000
[1] Setting TPTNumber with value 32
[1] Setting TPTLeafSize with value 2000
[1] Setting NumTopDimensionTpTreeSplit with value 5
[1] Setting NeighborhoodSize with value 32
[1] Setting GraphNeighborhoodScale with value 2.000000
[1] Setting GraphCEFScale with value 2.000000
[1] Setting RefineIterations with value 2
[1] Setting EnableRebuild with value 0
[1] Setting CEF with value 1000
[1] Setting AddCEF with value 500
[1] Setting MaxCheckForRefineGraph with value 8192
[1] Setting RNGFactor with value 1.000000
[1] Setting GPUGraphType with value 2
[1] Setting GPURefineSteps with value 0
[1] Setting GPURefineDepth with value 30
[1] Setting GPULeafSize with value 500
[1] Setting HeadNumGPUs with value 1
[1] Setting TPTBalanceFactor with value 2
[1] Setting NumberOfThreads with value 2
[1] Setting DistCalcMethod with value L2
[1] Setting DeletePercentageForRefine with value 0.400000
[1] Setting AddCountForRebuild with value 1000
[1] Setting MaxCheck with value 8192
[1] Setting ThresholdOfNumberOfContinuousNoBetterPropagation with value 3
[1] Setting NumberOfInitialDynamicPivots with value 50
[1] Setting NumberOfOtherDynamicPivots with value 4
[1] Setting HashTableExponent with value 2
[1] Setting DataBlockSize with value 1048576
[1] Setting DataCapacity with value 2147483647
[1] Setting MetaRecordSize with value 10
[1] Load Vector (202,100) Finish!
[1] Load BKT (1,204) Finish!
[1] Load RNG (202,32) Finish!
[1] Load DeleteID (202,1) Finish!
[1] Setting NumberOfThreads with value 2
[1] Setting MaxCheck with value 2048
[1] Setting HashTableExponent with value 4
[1] Finish reading header info, list count 202, total doc count 1000, dimension 100, list page offset 1.
[1] Big page (>4K): list count 134, total element count 2223.
[1] Total Element Count: 2676
[1] Page Count Dist: 0 1
[1] Page Count Dist: 1 67
[1] Page Count Dist: 2 134
[1] select head time: 0.00s build head time: 0.00s build ssd time: 0.00s
[1] Start generating truth. It's maybe a long time.
[1] Load Vector(1000,100)
[1] Load Vector(10,100)
[1] Begin to generate truth for query(10,100) and doc(1000,100)...
[1] Start to write truth file...
[1] End generating truth.
[1] Start loading warmup query set...
[1] Load Vector(10,100)
[1] Start warmup...
[1] Searching: numThread: 2, numQueries: 10.
[1] Sent 0.00%...
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=245760 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=63 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=393216 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=753664 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=139264 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=425984 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=163840 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=303104 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=4096 offset=770048 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=81920 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=376832 reserved2=0 flags=0 resfd=0 
[1] Finish sending in 0.104 seconds, actuallQPS is 96.15, query count 10.
[1] 
Finish warmup...
[1] Start loading QuerySet...
[1] Load Vector(10,100)
[1] Start ANN Search...
[1] Searching: numThread: 2, numQueries: 10.
[1] Sent 0.00%...
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=245760 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=63 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=393216 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=753664 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=139264 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=425984 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=163840 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=303104 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=4096 offset=770048 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=81920 reserved2=0 flags=0 resfd=0 
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=376832 reserved2=0 flags=0 resfd=0 
[1] Finish sending in 0.014 seconds, actuallQPS is 714.29, query count 10.
[1] 
Finish ANN Search...
[1] Load Vector(1000,100)
[1] 
Load VectorSet(1000,100).
[1] Start loading TruthFile...
[1] Recall32@32: 0.984375 MRR@32: 1.000000
[1] 
Ex Elements Count:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles   Max
[1] 974.900 965.000 1028.000    1028.000    1028.000    1028.000    1028.000
[1] 
Head Latency Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles   Max
[1] 0.420   0.203   1.259   1.259   1.259   1.259   1.259
[1] 
Ex Latency Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles   Max
[1] 2.474   2.388   5.651   5.651   5.651   5.651   5.651
[1] 
Total Latency Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles   Max
[1] 2.894   2.715   5.835   5.835   5.835   5.835   5.835
[1] 
Total Disk Page Access Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles   Max
[1] 113.700  113     119     119     119     119     119
[1] 
Total Disk IO Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles   Max
[1] 63.900    64      64      64      64      64      64
[1] 
[1] Start output to sddtest/test_search_ssd_Float_L2_BKT_DEFAULT_TXT.result
[1] Recall@32: 0.984375 MRR@32: 1.000000
[1] 
Test case SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT did not check any assertions
./Test/src/SSDServingTest.cpp(621): Leaving test case "TestSearchSSDIndexFloatL2BKTDEFAULTTXT"; testing time: 183969us
pabs3 commented 1 year ago

I noticed that there are a lot of tests that fail, not just one.

All of the failing tests use the TestSearchSSDIndex function, which calls the SPTAG::SSDServing::BootProgram function.

This is the list of failing tests when AlgoTest/SPANNTest is enabled:

SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexFloatL2KDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexFloatCosineBKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexFloatCosineKDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt8L2BKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt8L2KDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt8CosineBKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt8CosineKDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexUInt8L2BKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexUInt8L2KDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexUInt8CosineBKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexUInt8CosineKDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt16L2BKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt16L2KDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt16CosineBKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt16CosineKDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexFloatL2BKTXVECXVEC
SSDServingTest/TestSearchSSDIndexFloatL2KDTXVECXVEC
SSDServingTest/TestSearchSSDIndexFloatCosineBKTXVECXVEC
SSDServingTest/TestSearchSSDIndexFloatCosineKDTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt8L2BKTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt8L2KDTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt8CosineBKTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt8CosineKDTXVECXVEC
SSDServingTest/TestSearchSSDIndexUInt8L2BKTXVECXVEC
SSDServingTest/TestSearchSSDIndexUInt8L2KDTXVECXVEC
SSDServingTest/TestSearchSSDIndexUInt8CosineBKTXVECXVEC
SSDServingTest/TestSearchSSDIndexUInt8CosineKDTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt16L2BKTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt16L2KDTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt16CosineBKTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt16CosineKDTXVECXVEC
SSDServingTest/RUN_FROM_MAP
pabs3 commented 1 year ago

I note that when I rebuild SPTAG and its tests with the -fsanitize=thread option, then all of these tests have ThreadSanitizer data races reported:

$ grep ThreadSanitizer *.log  | sed -E 's/^([^:]*):(.*)/\2 \1/' | sort | uniq -c | sort -n | grep pid=
      1 WARNING: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) (pid=781640) PerfTest.cpp.log
      2 WARNING: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) (pid=780260) FilterTest.cpp.log
      3 WARNING: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) (pid=779821) ConcurrentTest.cpp.log
     12 WARNING: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) (pid=775481) AlgoTest.cpp.log
    122 WARNING: ThreadSanitizer: data race (pid=780260) FilterTest.cpp.log
    155 WARNING: ThreadSanitizer: data race (pid=779821) ConcurrentTest.cpp.log
    303 WARNING: ThreadSanitizer: data race (pid=971350) SSDServingTest.cpp.log
    324 WARNING: ThreadSanitizer: data race (pid=781640) PerfTest.cpp.log
    470 WARNING: ThreadSanitizer: data race (pid=782661) ReconstructIndexSimilarityTest.cpp.log
    872 WARNING: ThreadSanitizer: data race (pid=775481) AlgoTest.cpp.log
pabs3 commented 1 year ago

According to ThreadSanitizer these are the most common locations for data races in SPTAG:

$ grep ThreadSanitizer *.log  | sed -E 's/^([^:]*):(.*)/\2/' | sed 's/<[^>]*>/<>/' | sort | uniq -c | sort -n | grep -v /usr/ | grep -v /libsanitizer/ | grep -v pid= | tail
     28 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/NeighborhoodGraph.h:467 in void SPTAG::COMMON::NeighborhoodGraph::RefineGraph<>(SPTAG::VectorIndex*, std::unordered_map<int, int, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, int> > > const*) [clone ._omp_fn.0]
     36 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/NeighborhoodGraph.h:85 in SPTAG::COMMON::NeighborhoodGraph::GraphAccuracyEstimation(SPTAG::VectorIndex*, int, std::unordered_map<>, std::equal_to<int>, std::allocator<std::pair<int const, int> > > const*) [clone ._omp_fn.0]
     46 SUMMARY: ThreadSanitizer: data race AnnService/inc/Helper/ThreadPool.h:29 in SPTAG::Helper::ThreadPool::Abort::SetAbort(bool)
     54 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/BKTree.h:247 in float SPTAG::COMMON::KmeansAssign<>(SPTAG::COMMON::Dataset<float> const&, std::vector<int, std::allocator<int> >&, int, int, SPTAG::COMMON::KmeansArgs<float>&, bool, float) [clone ._omp_fn.0]
     55 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/Dataset.h:75 in SPTAG::COMMON::Dataset<>::R() const
     68 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/BKTree.h:92 in SPTAG::COMMON::KmeansArgs<>::ClearCounts()
     69 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/BKTree.h:93 in SPTAG::COMMON::KmeansArgs<>::ClearCounts()
     74 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/KDTree.h:335 in void SPTAG::COMMON::KDTree::ChooseDivision<>(SPTAG::COMMON::Dataset<signed char> const&, SPTAG::COMMON::KDTNode&, std::vector<int, std::allocator<int> > const&, int, int)
     75 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/BKTree.h:102 in SPTAG::COMMON::KmeansArgs<>::ClearDists(float)
     87 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/BKTree.h:103 in SPTAG::COMMON::KmeansArgs<>::ClearDists(float)
ZhenNan2016 commented 1 year ago

@pabs3 I'm sorry to bother you. I would like to ask a few questions as follows: https://github.com/microsoft/SPTAG/issues/403 But I don't know what platform to contact you through. Looking forward to your reply. Thanks very much.

pabs3 commented 1 year ago

Sorry, I am not involved in the project, I am just packaging it for my employer. So I have no idea about the answers to your questions. You will need to contact the Microsoft employees who wrote it. Some of their email addresses are available in the git log. Clone the project locally and run the git log command and you will see the addresses.

ZhenNan2016 commented 1 year ago

Sorry, I am not involved in the project, I am just packaging it for my employer. So I have no idea about the answers to your questions. You will need to contact the Microsoft employees who wrote it. Some of their email addresses are available in the git log. Clone the project locally and run the git log command and you will see the addresses.

okay,thanks.

Warmchay commented 4 months ago

Hi, I have encountered the same question, have you found the root cause of this problem?

pabs3 commented 4 months ago

I wasn't able to figure out the cause for this bug so far and I don't plan to work on it any further at this point.

This workaround enables all the tests to be run successfully, first run the tests excluding SPANNTest by using this environment variable BOOST_TEST_RUN_FILTERS=!AlgoTest/SPANNTest then run the tests again but include only SPANNTest with this environment variable BOOST_TEST_RUN_FILTERS=AlgoTest/SPANNTest.

Warmchay commented 4 months ago

I wasn't able to figure out the cause for this bug so far and I don't plan to work on it any further at this point.

This workaround enables all the tests to be run successfully, first run the tests excluding SPANNTest by using this environment variable BOOST_TEST_RUN_FILTERS=!AlgoTest/SPANNTest then run the tests again but include only SPANNTest with this environment variable BOOST_TEST_RUN_FILTERS=AlgoTest/SPANNTest.

Really thanks! It works, but it's weird, I would like to follow your testing way to check it again, I suspect that maybe the executed order influence the indices in SSD, like threads competition or other causes. Anyway, thanks your help again!

pabs3 commented 4 months ago

If you find any additional details, please send them here. If you find a fix, please submit a pull request with the needed changes and put this as the last line in the commit message, so that your patch will close this issue:

Fixes: https://github.com/microsoft/SPTAG/issues/375
Warmchay commented 4 months ago

Okay, I hope I could~

If you find any additional details, please send them here. If you find a fix, please submit a pull request with the needed changes and put this as the last line in the commit message, so that your patch will close this issue:

Fixes: https://github.com/microsoft/SPTAG/issues/375

Okay, I hope I could ~

suppersam1 commented 1 month ago

It may be related to the number of CPU cores. My CPU has 8 cores, and when I use 7 threads to access the search interface, there are no issues. However, when I use more than 8 threads, errors start to occur, specifically from channel 8 onwards. For example, if I use 15 threads, channels 8 to 15 will report errors.

suppersam1 commented 1 month ago

I have reviewed a lot of information, but it seems that there is no apparent relationship between the io_submit call and the number of CPU cores. Therefore, I am still at a loss regarding this bug.

suppersam1 commented 1 month ago

I tested it again, and it is not related to the number of CPU cores; it is related to the parameter NumberOfThreads. When I set this parameter to 45, there were no issues with the query. Previously, I had this parameter set to 8.