Open pabs3 opened 1 year ago
Enabling use of libnuma doesn't fix this issue.
Reverting just the changes to AnnService/inc/SSDServing/SSDIndex.h
in commit 5ffeabe4d0df57a6d4b870f53457ea5d20986061 didn't work either.
Can you log the values of the iocbs
at the time you get the error message? The value of aio_reqprio
should be 0
from the memset
, which maps to IOPRIO_CLASS_NONE
.
What's the compiler/platform? In our linux unit test run (Ubuntu 20.04, GCC 9.4.0) we don't encounter this log.
Not sure if I got the code right (see the async-log-submissions branch in my fork), but I tried logging aio_reqprio and it was always zero, so probably my initial diagnosis might have been incorrect.
This is on Debian 12 bookworm and GCC 12.2.0.
I'll try some older versions of Debian today.
-- bye, pabs
After modifying the code to log all io_submit requests, this is what the output looks like:
./Test/src/SSDServingTest.cpp(621): Entering test case "TestSearchSSDIndexFloatL2BKTDEFAULTTXT"
[Thread 0x7ffff5f9a6c0 (LWP 1940611) exited]
sddtest/test_search_ssd_Float_L2_BKT.builder.ini was generated. Skip generation.
[1] Set QuantizerFile =
[1] Setting Dim with value 100
[1] Setting DistCalcMethod with value L2
[1] Setting GenerateTruth with value true
[1] Setting HeadIndexFolder with value head_Float_L2_BKT_DEFAULT.head_index
[1] Setting HeadVectorIDs with value head_ids_Float_L2_DEFAULT.bin
[1] Setting HeadVectors with value
[1] Setting IndexAlgoType with value BKT
[1] Setting IndexDirectory with value zbtest
[1] Setting QueryDelimiter with value
[1] Setting QueryPath with value sddtest/vectors_Float_DEFAULT.query
[1] Setting QuerySize with value 10
[1] Setting QueryType with value DEFAULT
[1] Setting SSDIndex with value ssd_Float_L2_BKT_DEFAULT.ssd_index
[1] Setting TruthPath with value sddtest/vectors_Float_L2_DEFAULT_TXT.truth
[1] Setting TruthType with value TXT
[1] Setting ValueType with value Float
[1] Setting VectorDelimiter with value
[1] Setting VectorPath with value sddtest/vectors_Float_DEFAULT.bin
[1] Setting VectorSize with value 1000
[1] Setting VectorType with value DEFAULT
[1] Setting WarmupDelimiter with value
[1] Setting WarmupPath with value sddtest/vectors_Float_DEFAULT.query
[1] Setting WarmupSize with value 10
[1] Setting WarmupType with value DEFAULT
[1] Setting SearchInternalResultNum with value 64
[1] Setting BuildSsdIndex with value false
[1] Setting isExecute with value true
[1] Setting LogFile with value
[1] Setting MaxCheck with value 2048
[1] Setting NumberOfThreads with value 2
[1] Setting QpsLimit with value 0
[1] Setting QueryCountLimit with value 10000
[1] Setting ResultNum with value 32
[1] Setting SearchPostingPageLimit with value 2
[1] Setting SearchResult with value sddtest/test_search_ssd_Float_L2_BKT_DEFAULT_TXT.result
[1] Load Vector(1000,100)
[1] Begin Select Head...
[1] select head time: 0.00s
[1] Begin Build Head...
[1] select head time: 0.00s build head time: 0.00s
[1] Begin Build SSDIndex...
[1] Setting TreeFilePath with value tree.bin
[1] Setting GraphFilePath with value graph.bin
[1] Setting VectorFilePath with value vectors.bin
[1] Setting DeleteVectorFilePath with value deletes.bin
[1] Setting EnableBfs with value 0
[1] Setting BKTNumber with value 1
[1] Setting BKTKmeansK with value 32
[1] Setting BKTLeafSize with value 8
[1] Setting Samples with value 1000
[1] Setting BKTLambdaFactor with value 100.000000
[1] Setting TPTNumber with value 32
[1] Setting TPTLeafSize with value 2000
[1] Setting NumTopDimensionTpTreeSplit with value 5
[1] Setting NeighborhoodSize with value 32
[1] Setting GraphNeighborhoodScale with value 2.000000
[1] Setting GraphCEFScale with value 2.000000
[1] Setting RefineIterations with value 2
[1] Setting EnableRebuild with value 0
[1] Setting CEF with value 1000
[1] Setting AddCEF with value 500
[1] Setting MaxCheckForRefineGraph with value 8192
[1] Setting RNGFactor with value 1.000000
[1] Setting GPUGraphType with value 2
[1] Setting GPURefineSteps with value 0
[1] Setting GPURefineDepth with value 30
[1] Setting GPULeafSize with value 500
[1] Setting HeadNumGPUs with value 1
[1] Setting TPTBalanceFactor with value 2
[1] Setting NumberOfThreads with value 2
[1] Setting DistCalcMethod with value L2
[1] Setting DeletePercentageForRefine with value 0.400000
[1] Setting AddCountForRebuild with value 1000
[1] Setting MaxCheck with value 8192
[1] Setting ThresholdOfNumberOfContinuousNoBetterPropagation with value 3
[1] Setting NumberOfInitialDynamicPivots with value 50
[1] Setting NumberOfOtherDynamicPivots with value 4
[1] Setting HashTableExponent with value 2
[1] Setting DataBlockSize with value 1048576
[1] Setting DataCapacity with value 2147483647
[1] Setting MetaRecordSize with value 10
[1] Load Vector (202,100) Finish!
[1] Load BKT (1,204) Finish!
[1] Load RNG (202,32) Finish!
[1] Load DeleteID (202,1) Finish!
[New Thread 0x7ffff5f9a6c0 (LWP 1940614)]
[1] Setting NumberOfThreads with value 2
[1] Setting MaxCheck with value 2048
[1] Setting HashTableExponent with value 4
[1] Finish reading header info, list count 202, total doc count 1000, dimension 100, list page offset 1.
[1] Big page (>4K): list count 138, total element count 2232.
[1] Total Element Count: 2666
[1] Page Count Dist: 0 1
[1] Page Count Dist: 1 63
[1] Page Count Dist: 2 138
[1] select head time: 0.00s build head time: 0.00s build ssd time: 0.00s
[1] Start generating truth. It's maybe a long time.
[1] Load Vector(1000,100)
[1] Load Vector(10,100)
[1] Begin to generate truth for query(10,100) and doc(1000,100)...
[1] Start to write truth file...
[1] End generating truth.
[1] Start loading warmup query set...
[1] Load Vector(10,100)
[1] Start warmup...
[1] Searching: numThread: 2, numQueries: 10.
[New Thread 0x7ffff2f976c0 (LWP 1940615)]
[1] Sent 0.00%...
[New Thread 0x7ffff0f906c0 (LWP 1940616)]
[4] io_submit ctx_id=140737353871360 nr=64 data=140737091778224 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140737090953216 nbytes=8192 offset=368640 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
[4] io_submit ctx_id=93825021677888 nr=64 data=140736554907312 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=140736554082304 nbytes=8192 offset=221184 reserved2=0 flags=0 resfd=0
[4] fid:0 channel 2, to submit:64, submitted:-1:Operation not permitted
I noticed that when I enable only the AlgoTest/SPANNTest
and SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT
tests then the SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT
test fails:
BOOST_TEST_RUN_FILTERS=AlgoTest/SPANNTest:SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT ../SPTAGTest
I also noticed that enabling all tests except the AlgoTest/SPANNTest
test means all the tests pass, including SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT
:
BOOST_TEST_RUN_FILTERS='!AlgoTest/SPANNTest' ../SPTAGTest
So I think that AlgoTest/SPANNTest is somehow interfering with SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT
.
The successful SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT
test output looks like this:
./Test/src/SSDServingTest.cpp(621): Entering test case "TestSearchSSDIndexFloatL2BKTDEFAULTTXT"
sddtest/test_search_ssd_Float_L2_BKT.builder.ini was generated. Skip generation.
[1] Set QuantizerFile =
[1] Setting Dim with value 100
[1] Setting DistCalcMethod with value L2
[1] Setting GenerateTruth with value true
[1] Setting HeadIndexFolder with value head_Float_L2_BKT_DEFAULT.head_index
[1] Setting HeadVectorIDs with value head_ids_Float_L2_DEFAULT.bin
[1] Setting HeadVectors with value
[1] Setting IndexAlgoType with value BKT
[1] Setting IndexDirectory with value zbtest
[1] Setting QueryDelimiter with value
[1] Setting QueryPath with value sddtest/vectors_Float_DEFAULT.query
[1] Setting QuerySize with value 10
[1] Setting QueryType with value DEFAULT
[1] Setting SSDIndex with value ssd_Float_L2_BKT_DEFAULT.ssd_index
[1] Setting TruthPath with value sddtest/vectors_Float_L2_DEFAULT_TXT.truth
[1] Setting TruthType with value TXT
[1] Setting ValueType with value Float
[1] Setting VectorDelimiter with value
[1] Setting VectorPath with value sddtest/vectors_Float_DEFAULT.bin
[1] Setting VectorSize with value 1000
[1] Setting VectorType with value DEFAULT
[1] Setting WarmupDelimiter with value
[1] Setting WarmupPath with value sddtest/vectors_Float_DEFAULT.query
[1] Setting WarmupSize with value 10
[1] Setting WarmupType with value DEFAULT
[1] Setting SearchInternalResultNum with value 64
[1] Setting BuildSsdIndex with value false
[1] Setting isExecute with value true
[1] Setting LogFile with value
[1] Setting MaxCheck with value 2048
[1] Setting NumberOfThreads with value 2
[1] Setting QpsLimit with value 0
[1] Setting QueryCountLimit with value 10000
[1] Setting ResultNum with value 32
[1] Setting SearchPostingPageLimit with value 2
[1] Setting SearchResult with value sddtest/test_search_ssd_Float_L2_BKT_DEFAULT_TXT.result
[1] Load Vector(1000,100)
[1] Begin Select Head...
[1] select head time: 0.00s
[1] Begin Build Head...
[1] select head time: 0.00s build head time: 0.00s
[1] Begin Build SSDIndex...
[1] Setting TreeFilePath with value tree.bin
[1] Setting GraphFilePath with value graph.bin
[1] Setting VectorFilePath with value vectors.bin
[1] Setting DeleteVectorFilePath with value deletes.bin
[1] Setting EnableBfs with value 0
[1] Setting BKTNumber with value 1
[1] Setting BKTKmeansK with value 32
[1] Setting BKTLeafSize with value 8
[1] Setting Samples with value 1000
[1] Setting BKTLambdaFactor with value 100.000000
[1] Setting TPTNumber with value 32
[1] Setting TPTLeafSize with value 2000
[1] Setting NumTopDimensionTpTreeSplit with value 5
[1] Setting NeighborhoodSize with value 32
[1] Setting GraphNeighborhoodScale with value 2.000000
[1] Setting GraphCEFScale with value 2.000000
[1] Setting RefineIterations with value 2
[1] Setting EnableRebuild with value 0
[1] Setting CEF with value 1000
[1] Setting AddCEF with value 500
[1] Setting MaxCheckForRefineGraph with value 8192
[1] Setting RNGFactor with value 1.000000
[1] Setting GPUGraphType with value 2
[1] Setting GPURefineSteps with value 0
[1] Setting GPURefineDepth with value 30
[1] Setting GPULeafSize with value 500
[1] Setting HeadNumGPUs with value 1
[1] Setting TPTBalanceFactor with value 2
[1] Setting NumberOfThreads with value 2
[1] Setting DistCalcMethod with value L2
[1] Setting DeletePercentageForRefine with value 0.400000
[1] Setting AddCountForRebuild with value 1000
[1] Setting MaxCheck with value 8192
[1] Setting ThresholdOfNumberOfContinuousNoBetterPropagation with value 3
[1] Setting NumberOfInitialDynamicPivots with value 50
[1] Setting NumberOfOtherDynamicPivots with value 4
[1] Setting HashTableExponent with value 2
[1] Setting DataBlockSize with value 1048576
[1] Setting DataCapacity with value 2147483647
[1] Setting MetaRecordSize with value 10
[1] Load Vector (202,100) Finish!
[1] Load BKT (1,204) Finish!
[1] Load RNG (202,32) Finish!
[1] Load DeleteID (202,1) Finish!
[1] Setting NumberOfThreads with value 2
[1] Setting MaxCheck with value 2048
[1] Setting HashTableExponent with value 4
[1] Finish reading header info, list count 202, total doc count 1000, dimension 100, list page offset 1.
[1] Big page (>4K): list count 134, total element count 2223.
[1] Total Element Count: 2676
[1] Page Count Dist: 0 1
[1] Page Count Dist: 1 67
[1] Page Count Dist: 2 134
[1] select head time: 0.00s build head time: 0.00s build ssd time: 0.00s
[1] Start generating truth. It's maybe a long time.
[1] Load Vector(1000,100)
[1] Load Vector(10,100)
[1] Begin to generate truth for query(10,100) and doc(1000,100)...
[1] Start to write truth file...
[1] End generating truth.
[1] Start loading warmup query set...
[1] Load Vector(10,100)
[1] Start warmup...
[1] Searching: numThread: 2, numQueries: 10.
[1] Sent 0.00%...
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=245760 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309256704 nr=63 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=393216 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=753664 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=139264 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=425984 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=163840 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=303104 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=4096 offset=770048 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=81920 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=376832 reserved2=0 flags=0 resfd=0
[1] Finish sending in 0.104 seconds, actuallQPS is 96.15, query count 10.
[1]
Finish warmup...
[1] Start loading QuerySet...
[1] Load Vector(10,100)
[1] Start ANN Search...
[1] Searching: numThread: 2, numQueries: 10.
[1] Sent 0.00%...
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=245760 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309256704 nr=63 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=393216 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=753664 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=139264 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=425984 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=163840 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309264896 nr=64 data=139722338645680 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722337820672 nbytes=8192 offset=303104 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=4096 offset=770048 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=81920 reserved2=0 flags=0 resfd=0
[4] io_submit ctx_id=139723309256704 nr=64 data=139722942625456 key=0 rw_flags=0 lio_opcode=0 reqprio=0 fildes=3 buf=139722941800448 nbytes=8192 offset=376832 reserved2=0 flags=0 resfd=0
[1] Finish sending in 0.014 seconds, actuallQPS is 714.29, query count 10.
[1]
Finish ANN Search...
[1] Load Vector(1000,100)
[1]
Load VectorSet(1000,100).
[1] Start loading TruthFile...
[1] Recall32@32: 0.984375 MRR@32: 1.000000
[1]
Ex Elements Count:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles Max
[1] 974.900 965.000 1028.000 1028.000 1028.000 1028.000 1028.000
[1]
Head Latency Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles Max
[1] 0.420 0.203 1.259 1.259 1.259 1.259 1.259
[1]
Ex Latency Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles Max
[1] 2.474 2.388 5.651 5.651 5.651 5.651 5.651
[1]
Total Latency Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles Max
[1] 2.894 2.715 5.835 5.835 5.835 5.835 5.835
[1]
Total Disk Page Access Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles Max
[1] 113.700 113 119 119 119 119 119
[1]
Total Disk IO Distribution:
[1] Avg 50tiles 90tiles 95tiles 99tiles 99.9tiles Max
[1] 63.900 64 64 64 64 64 64
[1]
[1] Start output to sddtest/test_search_ssd_Float_L2_BKT_DEFAULT_TXT.result
[1] Recall@32: 0.984375 MRR@32: 1.000000
[1]
Test case SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT did not check any assertions
./Test/src/SSDServingTest.cpp(621): Leaving test case "TestSearchSSDIndexFloatL2BKTDEFAULTTXT"; testing time: 183969us
I noticed that there are a lot of tests that fail, not just one.
All of the failing tests use the TestSearchSSDIndex
function, which calls the SPTAG::SSDServing::BootProgram
function.
This is the list of failing tests when AlgoTest/SPANNTest is enabled:
SSDServingTest/TestSearchSSDIndexFloatL2BKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexFloatL2KDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexFloatCosineBKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexFloatCosineKDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt8L2BKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt8L2KDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt8CosineBKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt8CosineKDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexUInt8L2BKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexUInt8L2KDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexUInt8CosineBKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexUInt8CosineKDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt16L2BKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt16L2KDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt16CosineBKTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexInt16CosineKDTDEFAULTTXT
SSDServingTest/TestSearchSSDIndexFloatL2BKTXVECXVEC
SSDServingTest/TestSearchSSDIndexFloatL2KDTXVECXVEC
SSDServingTest/TestSearchSSDIndexFloatCosineBKTXVECXVEC
SSDServingTest/TestSearchSSDIndexFloatCosineKDTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt8L2BKTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt8L2KDTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt8CosineBKTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt8CosineKDTXVECXVEC
SSDServingTest/TestSearchSSDIndexUInt8L2BKTXVECXVEC
SSDServingTest/TestSearchSSDIndexUInt8L2KDTXVECXVEC
SSDServingTest/TestSearchSSDIndexUInt8CosineBKTXVECXVEC
SSDServingTest/TestSearchSSDIndexUInt8CosineKDTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt16L2BKTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt16L2KDTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt16CosineBKTXVECXVEC
SSDServingTest/TestSearchSSDIndexInt16CosineKDTXVECXVEC
SSDServingTest/RUN_FROM_MAP
I note that when I rebuild SPTAG and its tests with the -fsanitize=thread
option, then all of these tests have ThreadSanitizer data races reported:
$ grep ThreadSanitizer *.log | sed -E 's/^([^:]*):(.*)/\2 \1/' | sort | uniq -c | sort -n | grep pid=
1 WARNING: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) (pid=781640) PerfTest.cpp.log
2 WARNING: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) (pid=780260) FilterTest.cpp.log
3 WARNING: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) (pid=779821) ConcurrentTest.cpp.log
12 WARNING: ThreadSanitizer: data race on vptr (ctor/dtor vs virtual call) (pid=775481) AlgoTest.cpp.log
122 WARNING: ThreadSanitizer: data race (pid=780260) FilterTest.cpp.log
155 WARNING: ThreadSanitizer: data race (pid=779821) ConcurrentTest.cpp.log
303 WARNING: ThreadSanitizer: data race (pid=971350) SSDServingTest.cpp.log
324 WARNING: ThreadSanitizer: data race (pid=781640) PerfTest.cpp.log
470 WARNING: ThreadSanitizer: data race (pid=782661) ReconstructIndexSimilarityTest.cpp.log
872 WARNING: ThreadSanitizer: data race (pid=775481) AlgoTest.cpp.log
According to ThreadSanitizer these are the most common locations for data races in SPTAG:
$ grep ThreadSanitizer *.log | sed -E 's/^([^:]*):(.*)/\2/' | sed 's/<[^>]*>/<>/' | sort | uniq -c | sort -n | grep -v /usr/ | grep -v /libsanitizer/ | grep -v pid= | tail
28 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/NeighborhoodGraph.h:467 in void SPTAG::COMMON::NeighborhoodGraph::RefineGraph<>(SPTAG::VectorIndex*, std::unordered_map<int, int, std::hash<int>, std::equal_to<int>, std::allocator<std::pair<int const, int> > > const*) [clone ._omp_fn.0]
36 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/NeighborhoodGraph.h:85 in SPTAG::COMMON::NeighborhoodGraph::GraphAccuracyEstimation(SPTAG::VectorIndex*, int, std::unordered_map<>, std::equal_to<int>, std::allocator<std::pair<int const, int> > > const*) [clone ._omp_fn.0]
46 SUMMARY: ThreadSanitizer: data race AnnService/inc/Helper/ThreadPool.h:29 in SPTAG::Helper::ThreadPool::Abort::SetAbort(bool)
54 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/BKTree.h:247 in float SPTAG::COMMON::KmeansAssign<>(SPTAG::COMMON::Dataset<float> const&, std::vector<int, std::allocator<int> >&, int, int, SPTAG::COMMON::KmeansArgs<float>&, bool, float) [clone ._omp_fn.0]
55 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/Dataset.h:75 in SPTAG::COMMON::Dataset<>::R() const
68 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/BKTree.h:92 in SPTAG::COMMON::KmeansArgs<>::ClearCounts()
69 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/BKTree.h:93 in SPTAG::COMMON::KmeansArgs<>::ClearCounts()
74 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/KDTree.h:335 in void SPTAG::COMMON::KDTree::ChooseDivision<>(SPTAG::COMMON::Dataset<signed char> const&, SPTAG::COMMON::KDTNode&, std::vector<int, std::allocator<int> > const&, int, int)
75 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/BKTree.h:102 in SPTAG::COMMON::KmeansArgs<>::ClearDists(float)
87 SUMMARY: ThreadSanitizer: data race AnnService/inc/Core/Common/BKTree.h:103 in SPTAG::COMMON::KmeansArgs<>::ClearDists(float)
@pabs3 I'm sorry to bother you. I would like to ask a few questions as follows: https://github.com/microsoft/SPTAG/issues/403 But I don't know what platform to contact you through. Looking forward to your reply. Thanks very much.
Sorry, I am not involved in the project, I am just packaging it for my employer. So I have no idea about the answers to your questions. You will need to contact the Microsoft employees who wrote it. Some of their email addresses are available in the git log. Clone the project locally and run the git log command and you will see the addresses.
Sorry, I am not involved in the project, I am just packaging it for my employer. So I have no idea about the answers to your questions. You will need to contact the Microsoft employees who wrote it. Some of their email addresses are available in the git log. Clone the project locally and run the git log command and you will see the addresses.
okay,thanks.
Hi, I have encountered the same question, have you found the root cause of this problem?
I wasn't able to figure out the cause for this bug so far and I don't plan to work on it any further at this point.
This workaround enables all the tests to be run successfully, first run the tests excluding SPANNTest
by using this environment variable BOOST_TEST_RUN_FILTERS=!AlgoTest/SPANNTest
then run the tests again but include only SPANNTest
with this environment variable BOOST_TEST_RUN_FILTERS=AlgoTest/SPANNTest
.
I wasn't able to figure out the cause for this bug so far and I don't plan to work on it any further at this point.
This workaround enables all the tests to be run successfully, first run the tests excluding
SPANNTest
by using this environment variableBOOST_TEST_RUN_FILTERS=!AlgoTest/SPANNTest
then run the tests again but include onlySPANNTest
with this environment variableBOOST_TEST_RUN_FILTERS=AlgoTest/SPANNTest
.
Really thanks! It works, but it's weird, I would like to follow your testing way to check it again, I suspect that maybe the executed order influence the indices in SSD, like threads competition or other causes. Anyway, thanks your help again!
If you find any additional details, please send them here. If you find a fix, please submit a pull request with the needed changes and put this as the last line in the commit message, so that your patch will close this issue:
Fixes: https://github.com/microsoft/SPTAG/issues/375
Okay, I hope I could~
If you find any additional details, please send them here. If you find a fix, please submit a pull request with the needed changes and put this as the last line in the commit message, so that your patch will close this issue:
Fixes: https://github.com/microsoft/SPTAG/issues/375
Okay, I hope I could ~
It may be related to the number of CPU cores. My CPU has 8 cores, and when I use 7 threads to access the search interface, there are no issues. However, when I use more than 8 threads, errors start to occur, specifically from channel 8 onwards. For example, if I use 15 threads, channels 8 to 15 will report errors.
I have reviewed a lot of information, but it seems that there is no apparent relationship between the io_submit call and the number of CPU cores. Therefore, I am still at a loss regarding this bug.
I tested it again, and it is not related to the number of CPU cores; it is related to the parameter NumberOfThreads. When I set this parameter to 45, there were no issues with the query. Previously, I had this parameter set to 8.
On Linux, #359 by @PhilipBAdams causes the
SSDServingTest
TestSearchSSDIndexFloatL2BKTDEFAULTTXT
test to continously print this message over and over again:This message comes from the
BatchReadFileAsync
function inAnnService/src/Helper/AsyncFileReader.cpp
and is printed when theio_submit
syscall fails:According to the
io_submit
manual page, this indicates incorrectly setting the RT IO priority class:However, the code does not set the IO priority class anywhere.
My conclusion is that maybe the
iocb
structure being passed toio_submit
gets corrupted somehow, but I am not sure.Thoughts and ideas for further debugging of this are welcome.
I also tried commenting out the contents of the Helper::SetThreadAffinity function but that didn't work.