microsoft / SPTAG

A distributed approximate nearest neighborhood search (ANN) library which provides a high quality vector index build, search and distributed online serving toolkits for large scale vector search scenario.
MIT License
4.83k stars 580 forks source link

random test failures on older CPUs without SSE/AVX/AVX2/AVX512 #316

Open pabs3 opened 2 years ago

pabs3 commented 2 years ago

Describe the bug There are strange random test failures on older CPUs that don't have SSE/AVX/AVX2/AVX512.

To Reproduce Steps to reproduce the behavior:

  1. Build SPTAG on a machine without AVX/AVX2/AVX512.
  2. Run SPTAGTests
  3. See error

Expected behavior The tests should work.

Analysis I think that because of the -mavx2 -mavx -msse -msse2 -mavx512f -mavx512bw -mavx512dq options in the DistanceUtils target_compile_options, the compiler is generating newer instructions in the DistanceUtils library and these are not run on older CPUs. Removing the options not supported by the CPU and deleting the functions using instructions that those options enable fixes this issue. So the cause is definitely the options being enabled.

Suggestions On Linux you can use GCC function multi-versioning to get the compiler to automatically check the CPU at runtime and dispatch to the right functions.

Screenshots Some examples of the random failures:

[1] Start invoking BuildTrees.
[1] BKTKmeansK: 3, BKTLeafSize: 6, Samples: 100, BKTLambdaFactor:-1.000000 TreeNumber: 1, ThreadNum: 2.
unknown location(0): fatal error: in "SSDServingTest/TestHeadUInt8L2DEFAULT": memory access violation at address: 0x00000000: no mapping at fault address
./Test/src/SSDServingTest.cpp(444): last checkpoint: "TestHeadUInt8L2DEFAULT" test entry

*** 1 failure is detected in the test module "Main"
[1] Parallel TpTree Partition done
[1] Build TPTree time (s): 4
[1] Processing Tree 0 0%
unknown location(0): fatal error: in "AlgoTest/KDTTest": signal: illegal operand; address of failing instruction: 0x559a5eccc130
./Test/src/AlgoTest.cpp(22): last checkpoint

*** 1 failure is detected in the test module "Main"

Desktop:

pabs3 commented 2 years ago

If someone were to fix up and merge #136 that fixes #134 requesting support for 64-bit ARM, then this issue would be fixed too.