milvus-io / knowhere

Knowhere is an open-source vector search engine, integrating FAISS, HNSW, etc.
Apache License 2.0
208 stars 84 forks source link

Recall of gpu_ivf_pq is lower than ivf_pq with dataset glove-200 #332

Open yanliang567 opened 2 years ago

yanliang567 commented 2 years ago

compare the logs below, there are 2 issues actually:

  1. recall of gpu and cpu are not same
  2. gpu is slower than cpu when nprobe=512
    nlist=1024
    m=8
    nbits=8
    gpu_ivf_pq search on glove recalls: [0.1168, 0.1362, 0.1479, 0.1543, 0.1578, 0.1598, 0.1607, 0.161, 0.161, 0.161]
    gpu_ivf_pq search on glove latency: [0.0278, 0.0368, 0.0539, 0.093, 0.1608, 0.9298, 1.9743, 3.7446, 7.3059, 14.6296]
    ivf_pq search on glove recalls: [0.1357, 0.1549, 0.1657, 0.1709, 0.1736, 0.1752, 0.1758, 0.176, 0.176, 0.176]
    ivf_pq search on glove latency: [0.2152, 0.1308, 0.2232, 0.3057, 0.4902, 0.6978, 1.3009, 2.3552, 4.5747, 8.9149]
cydrain commented 2 years ago

can reproduce @yanliang567 's issue:

CPU run

[58.885 s] glove-200-angular | IVF_PQ | nlist=1024
================================================================================
  nprobe =    1, nq = 10000, k =  100, elapse =  0.152s, R@ = 0.1351
  nprobe =    2, nq = 10000, k =  100, elapse =  0.156s, R@ = 0.1545
  nprobe =    4, nq = 10000, k =  100, elapse =  0.258s, R@ = 0.1654
  nprobe =    8, nq = 10000, k =  100, elapse =  0.239s, R@ = 0.1707
  nprobe =   16, nq = 10000, k =  100, elapse =  0.338s, R@ = 0.1733
  nprobe =   32, nq = 10000, k =  100, elapse =  0.576s, R@ = 0.1750
  nprobe =   64, nq = 10000, k =  100, elapse =  0.895s, R@ = 0.1757
  nprobe =  128, nq = 10000, k =  100, elapse =  1.545s, R@ = 0.1758
  nprobe =  256, nq = 10000, k =  100, elapse =  3.048s, R@ = 0.1758
  nprobe =  512, nq = 10000, k =  100, elapse =  6.047s, R@ = 0.1758
================================================================================
[72.803 s] Test 'glove-200-angular/IVF_PQ' done

GPU run

[4.379 s] glove-200-angular | IVF_PQ | nlist=1024
================================================================================
  nprobe =    1, nq = 10000, k =  100, elapse =  0.029s, R@ = 0.1170
  nprobe =    2, nq = 10000, k =  100, elapse =  0.045s, R@ = 0.1367
  nprobe =    4, nq = 10000, k =  100, elapse =  0.083s, R@ = 0.1483
  nprobe =    8, nq = 10000, k =  100, elapse =  0.134s, R@ = 0.1548
  nprobe =   16, nq = 10000, k =  100, elapse =  0.246s, R@ = 0.1582
  nprobe =   32, nq = 10000, k =  100, elapse =  1.246s, R@ = 0.1601
  nprobe =   64, nq = 10000, k =  100, elapse =  2.438s, R@ = 0.1610
  nprobe =  128, nq = 10000, k =  100, elapse =  4.862s, R@ = 0.1613
  nprobe =  256, nq = 10000, k =  100, elapse =  9.730s, R@ = 0.1613
  nprobe =  512, nq = 10000, k =  100, elapse = 19.627s, R@ = 0.1613
================================================================================
[43.516 s] Test 'glove-200-angular/IVF_PQ' done
Presburger commented 2 years ago

==34089== NVPROF is profiling process 34089, command: ./test ==34089== Profiling application: ./test ==34089== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 25.56% 2.32929s 1250 1.8634ms 1.5279ms 2.3547ms void faiss::gpu::pqCodeDistances<float, float, int=32, bool=1>(faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, int, faiss::gpu::pqCodeDistances<float, float, int=32, bool=1, float, int=2, bool=1, int, faiss::gpu::traits>, faiss::gpu::pqCodeDistances<float, float, int=32, bool=1, float, int=3, bool=1, int, faiss::gpu::traits>, faiss::gpu::pqCodeDistances<float, float, int=32, bool=1, int, int=2, bool=1, int, faiss::gpu::traits>, faiss::gpu::pqCodeDistances<float, float, int=32, bool=1, float, int=4, bool=1, int, faiss::gpu::traits>) 17.12% 1.56048s 29608 52.704us 26.144us 81.952us void faiss::gpu::l2SelectMin1<float, int=8, int=256>(faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::l2SelectMin1<float, int=8, int=256, faiss::gpu::Tensor, int=1, bool=1, int, faiss::gpu::traits>, faiss::gpu::l2SelectMin1<float, int=8, int=256, unsigned char, int=1, bool=1, int, faiss::gpu::traits>, faiss::gpu::traits::DefaultPtrTraits, faiss::gpu::l2SelectMin1<float, int=8, int=256, int, int=2, bool=1, int, faiss::gpu::traits>) 17.10% 1.55805s 29628 52.586us 11.360us 182.88us volta_sgemm_128x32_tn 13.76% 1.25412s 1250 1.0033ms 722.82us 1.5675ms void faiss::gpu::pass1SelectLists(void, faiss::gpu::Tensor<int, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::traits::DefaultPtrTraits, void, void, int, int, faiss::gpu::IndicesOptions, void, void) 12.84% 1.17041s 1250 936.33us 890.98us 1.0710ms void faiss::gpu::pqScanNoPrecomputedMultiPass<int=8, float, float4>(faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::pqScanNoPrecomputedMultiPass<int=8, float, float4, float, int=3, bool=1, int, faiss::gpu::traits>, faiss::gpu::pqScanNoPrecomputedMultiPass<int=8, float, float4, int, int=2, bool=1, int, faiss::gpu::traits>, faiss::gpu::pqScanNoPrecomputedMultiPass<int=8, float, float4, float, int=4, bool=1, int, faiss::gpu::traits>, void, int, faiss::gpu::pqScanNoPrecomputedMultiPass<int=8, float, float4, float, int=3, bool=1, int, faiss::gpu::traits>, faiss::gpu::pqScanNoPrecomputedMultiPass<int=8, float, float4, float, int=1, bool=1, int, faiss::gpu::traits>) 7.00% 637.67ms 197892 3.2220us 543ns 11.942ms [CUDA memcpy HtoD] 1.64% 149.75ms 20380 7.3470us 2.4630us 42.080us void faiss::gpu::sumAlongRows<float, bool=1>(faiss::gpu::Tensor<float, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::sumAlongRows<float, bool=1, faiss::gpu::Tensor, int=2, bool=1, int, faiss::gpu::traits>) 1.49% 135.51ms 65573 2.0660us 1.6310us 531.84us void faiss::gpu::calcResidual<float, bool=0>(faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::calcResidual<float, bool=0, float, int=2, bool=1, int, faiss::gpu::traits>, faiss::gpu::calcResidual<float, bool=0, int, int=1, bool=1, int, faiss::gpu::traits>, faiss::gpu::traits::DefaultPtrTraits) 1.14% 103.45ms 65536 1.5780us 1.4390us 13.120us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::parallel_for::ParallelForAgent<thrust::cuda_cub::transform::unary_transform_f<long const , int, thrust::cuda_cub::transform::no_stencil_tag, faiss::gpu::Convert<long, int>, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::transform::unary_transform_f<long const , int, thrust::cuda_cub::__transform::no_stencil_tag, faiss::gpu::Convert<long, int>, thrust::cuda_cub::__transform::always_true_predicate>, long>(int, thrust::cuda_cub::transform::no_stencil_tag) 0.72% 66.048ms 66004 1.0000us 671ns 170.37us [CUDA memcpy DtoH] 0.62% 56.355ms 1074 52.471us 7.8080us 798.08us void faiss::gpu::l2NormRowMajor<float, float4, int, int=8, bool=0, bool=1>(faiss::gpu::Tensor<float4, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::l2NormRowMajor<float, float4, int, int=8, bool=0, bool=1, float, int=1, bool=1, float4, faiss::gpu::traits>) 0.26% 23.392ms 1250 18.713us 15.520us 25.312us void faiss::gpu::pass2SelectLists(faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::pass2SelectLists<int=128, int=128, int=3, bool=0, int, int=2, bool=1, int, faiss::gpu::traits>, void*, faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, int, faiss::gpu::IndicesOptions, faiss::gpu::traits::DefaultPtrTraits, faiss::gpu::pass2SelectLists<int=128, int=128, int=3, bool=0, long, int=2, bool=1, int, faiss::gpu::traits>) 0.24% 21.808ms 37 589.40us 72.544us 621.60us void faiss::gpu::transposeOuter<float, int>(float const , faiss::gpu::transposeOuter<float, int>, int, faiss::gpu::transposeOuter<float, int>, faiss::gpu::transposeOuter<float, int>) 0.15% 13.569ms 16253 834ns 543ns 12.768us [CUDA memset] 0.12% 11.226ms 6396 1.7550us 1.5670us 13.184us [CUDA memcpy DtoD] 0.09% 8.0448ms 20 402.24us 228.64us 447.23us void faiss::gpu::l2SelectMinK<float, int=512, int=8, int=128>(faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::l2SelectMinK<float, int=512, int=8, int=128, faiss::gpu::Tensor, int=1, bool=1, int, faiss::gpu::traits>, faiss::gpu::l2SelectMinK<float, int=512, int=8, int=128, unsigned char, int=1, bool=1, int, faiss::gpu::traits>, faiss::gpu::traits::DefaultPtrTraits, faiss::gpu::l2SelectMinK<float, int=512, int=8, int=128, int, int=2, bool=1, int, faiss::gpu::traits>, int, faiss::gpu::Tensor) 0.05% 4.3242ms 1250 3.4590us 2.8480us 14.720us void cub::DeviceScanKernel<cub::DeviceScanPolicy::Policy600, int, int, cub::ScanTileState<int, bool=1>, thrust::plus, cub::NullType, int>(cub::DeviceScanPolicy::Policy600, int, int, int, int, bool=1, cub::ScanTileState<int, bool=1>) 0.04% 4.0452ms 1250 3.2360us 2.4640us 14.016us faiss::gpu::getResultLengths(faiss::gpu::Tensor<int, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, int, int, faiss::gpu::Tensor<int, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>) 0.02% 1.9222ms 1250 1.5370us 1.3440us 12.833us void cub::DeviceScanInitKernel<cub::ScanTileState<int, bool=1>>(int, int) 0.01% 1.2610ms 221 5.7050us 4.1930us 7.6800us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::parallel_for::ParallelForAgent<thrust::cuda_cub::transform::unary_transform_f<int const , long, thrust::cuda_cub::__transform::no_stencil_tag, faiss::gpu::Convert<int, long>, thrust::cuda_cub::transform::always_true_predicate>, long>, thrust::cuda_cub::transform::unary_transform_f<int const , long, thrust::cuda_cub::__transform::no_stencil_tag, faiss::gpu::Convert<int, long>, thrust::cuda_cub::transform::always_true_predicate>, long>(long*, thrust::cuda_cub::transform::no_stencil_tag) 0.01% 1.2235ms 37 33.068us 6.3680us 37.248us faiss::gpu::ivfpqAppend(faiss::gpu::Tensor<int, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::Tensor<int, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::Tensor<unsigned char, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, void**) 0.01% 477.79us 37 12.913us 3.4240us 13.472us void faiss::gpu::transposeAny<unsigned char, unsigned int, int=2, int=-1>(faiss::gpu::TensorInfo<unsigned char, unsigned int>, unsigned int, unsigned char) 0.00% 397.73us 37 10.749us 3.0080us 11.649us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::parallel_for::ParallelForAgent<thrust::cuda_cub::transform::unary_transform_f<int const , unsigned char, thrust::cuda_cub::__transform::no_stencil_tag, faiss::gpu::Convert<int, unsigned char>, thrust::cuda_cub::transform::always_true_predicate>, long>, thrust::cuda_cub::transform::unary_transform_f<int const , unsigned char, thrust::cuda_cub::__transform::no_stencil_tag, faiss::gpu::Convert<int, unsigned char>, thrust::cuda_cub::transform::always_true_predicate>, long>(unsigned char*, thrust::cuda_cub::transform::no_stencil_tag) 0.00% 318.30us 37 8.6020us 3.2000us 9.1840us faiss::gpu::ivfIndicesAppend(faiss::gpu::Tensor<int, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::Tensor<int, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::Tensor<long, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::IndicesOptions, void*) 0.00% 97.183us 37 2.6260us 2.4000us 2.9760us faiss::gpu::runUpdateListPointers(faiss::gpu::Tensor<int, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::Tensor<int, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::Tensor<void, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::Tensor<void, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, int, void, void) 0.00% 15.456us 2 7.7280us 7.6800us 7.7760us void faiss::gpu::transposeAny<float, unsigned int, int=3, int=-1>(faiss::gpu::TensorInfo<float, unsigned int>, unsigned int, float) 0.00% 3.5840us 2 1.7920us 1.7600us 1.8240us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::parallel_for::ParallelForAgent<thrust::cuda_cub::uninitialized_fill::functor<thrust::device_ptr<void>, void>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<void>, void>, unsigned long>(thrust::device_ptr<void>, void) 0.00% 1.5040us 1 1.5040us 1.5040us 1.5040us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr, int>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr, int>, unsigned long>(thrust::device_ptr, int) API calls: 53.70% 4.46228s 67047 66.554us 1.0400us 52.440ms cudaStreamSynchronize 18.10% 1.50406s 272559 5.5180us 168ns 8.4946ms cudaMemcpyAsync 8.69% 722.23ms 221017 3.2670us 2.1400us 9.2982ms cudaLaunchKernel 6.54% 543.19ms 8898 61.046us 1.4170us 346.00ms cudaFree 3.78% 314.51ms 1 314.51ms 314.51ms 314.51ms cudaHostAlloc 3.45% 286.33ms 658881 434ns 290ns 333.07us cudaPointerGetAttributes 2.37% 197.17ms 856532 230ns 180ns 304.83us cudaGetDevice 1.16% 96.249ms 739977 130ns 95ns 309.62us cudaGetLastError 0.44% 36.972ms 16253 2.2740us 1.1750us 56.739us cudaMemsetAsync 0.42% 34.524ms 8896 3.8800us 1.4380us 6.4497ms cudaMalloc 0.35% 28.890ms 60506 477ns 293ns 21.269us cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags 0.26% 21.885ms 1 21.885ms 21.885ms 21.885ms cudaFreeHost 0.26% 21.532ms 67061 321ns 196ns 48.158us cudaDeviceGetAttribute 0.20% 16.524ms 136594 120ns 95ns 303.15us cudaPeekAtLastError 0.09% 7.0760ms 11370 622ns 326ns 309.61us cudaEventQuery 0.07% 6.1339ms 22740 269ns 207ns 15.645us cudaStreamGetCaptureInfo 0.06% 5.2487ms 9312 563ns 354ns 53.730us cudaEventRecord 0.02% 1.7035ms 2276 748ns 430ns 3.8690us cudaStreamWaitEvent 0.02% 1.3204ms 1750 754ns 300ns 2.5280us cudaEventCreateWithFlags 0.01% 937.49us 1750 535ns 253ns 17.569us cudaEventDestroy 0.00% 304.52us 297 1.0250us 89ns 66.105us cuDeviceGetAttribute 0.00% 115.36us 746 154ns 94ns 940ns cuGetProcAddress 0.00% 73.483us 1 73.483us 73.483us 73.483us cudaGetDeviceProperties 0.00% 54.867us 3 18.289us 9.5980us 28.382us cuDeviceGetName 0.00% 26.657us 4 6.6640us 1.2140us 22.374us cudaStreamCreateWithFlags 0.00% 11.165us 4 2.7910us 1.4510us 6.3230us cudaStreamDestroy 0.00% 9.7120us 4 2.4280us 719ns 5.9370us cudaDeviceSynchronize 0.00% 8.1420us 5 1.6280us 126ns 7.0080us cudaGetDeviceCount 0.00% 4.9230us 1 4.9230us 4.9230us 4.9230us cuDeviceGetPCIBusId 0.00% 3.8010us 2 1.9000us 1.8980us 1.9030us cuInit 0.00% 2.9580us 1 2.9580us 2.9580us 2.9580us cudaFuncGetAttributes 0.00% 2.2540us 5 450ns 131ns 1.2700us cuDeviceGetCount 0.00% 1.5960us 4 399ns 197ns 901ns cuDeviceGet 0.00% 949ns 3 316ns 195ns 513ns cuDeviceTotalMem 0.00% 843ns 1 843ns 843ns 843ns cudaGetSymbolAddress 0.00% 629ns 3 209ns 118ns 383ns cuDeviceGetUuid 0.00% 250ns 2 125ns 123ns 127ns cuDriverGetVersion

Presburger commented 2 years ago

==34437== NVPROF is profiling process 34437, command: ./test ==34437== Profiling application: ./test ==34437== Profiling result: Type Time(%) Time Calls Avg Min Max Name GPU activities: 62.46% 13.6743s 67254 203.32us 672ns 11.811ms [CUDA memcpy DtoH] 7.19% 1.57450s 29608 53.178us 26.049us 75.840us void faiss::gpu::l2SelectMin1<float, int=8, int=256>(faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::l2SelectMin1<float, int=8, int=256, faiss::gpu::Tensor, int=1, bool=1, int, faiss::gpu::traits>, faiss::gpu::l2SelectMin1<float, int=8, int=256, unsigned char, int=1, bool=1, int, faiss::gpu::traits>, faiss::gpu::traits::DefaultPtrTraits, faiss::gpu::l2SelectMin1<float, int=8, int=256, int, int=2, bool=1, int, faiss::gpu::traits>) 5.04% 1.10281s 1250 882.25us 874.05us 889.00us void faiss::gpu::pqScanNoPrecomputedMultiPass<int=8, float, float4>(faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::pqScanNoPrecomputedMultiPass<int=8, float, float4, float, int=3, bool=1, int, faiss::gpu::traits>, faiss::gpu::pqScanNoPrecomputedMultiPass<int=8, float, float4, int, int=2, bool=1, int, faiss::gpu::traits>, faiss::gpu::pqScanNoPrecomputedMultiPass<int=8, float, float4, float, int=4, bool=1, int, faiss::gpu::traits>, void, int*, faiss::gpu::pqScanNoPrecomputedMultiPass<int=8, float, float4, float, int=3, bool=1, int, faiss::gpu::traits>, faiss::gpu::pqScanNoPrecomputedMultiPass<int=8, float, float4, float, int=1, bool=1, int, faiss::gpu::traits>) 4.38% 958.22ms 1250 766.58us 698.53us 858.66us void faiss::gpu::pass1SelectLists(void, faiss::gpu::Tensor<int, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::traits::DefaultPtrTraits, void, void, int, int, faiss::gpu::IndicesOptions, void, void*) 4.11% 899.52ms 7580 118.67us 81.505us 155.17us volta_sgemm_128x32_tn 2.70% 591.54ms 1250 473.23us 466.53us 485.25us void faiss::gpu::sumAlongRows<float, bool=0>(faiss::gpu::Tensor<float, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::sumAlongRows<float, bool=0, faiss::gpu::Tensor, int=2, bool=1, int, faiss::gpu::traits>) 2.59% 566.21ms 1287 439.94us 70.112us 686.66us void faiss::gpu::transposeOuter<float, int>(float const , faiss::gpu::transposeOuter<float, int>, int, faiss::gpu::transposeOuter<float, int>, faiss::gpu::transposeOuter<float, int>) 2.49% 544.30ms 197882 2.7500us 543ns 19.054ms [CUDA memcpy HtoD] 2.46% 538.48ms 1250 430.78us 427.20us 438.02us void faiss::gpu::sumAlongColumns<float4, int=16, int=4, int=4>(faiss::gpu::Tensor<float4, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::sumAlongColumns<float4, int=16, int=4, int=4, faiss::gpu::Tensor, int=2, bool=1, int, faiss::gpu::traits>) 2.22% 486.56ms 22048 22.068us 11.008us 43.008us volta_sgemm_32x128_tn 1.51% 331.15ms 1250 264.92us 256.10us 275.49us volta_sgemm_32x128_nn 0.68% 148.95ms 20380 7.3080us 2.4000us 50.048us void faiss::gpu::sumAlongRows<float, bool=1>(faiss::gpu::Tensor<float, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::sumAlongRows<float, bool=1, faiss::gpu::Tensor, int=2, bool=1, int, faiss::gpu::traits>) 0.60% 130.99ms 65573 1.9970us 1.6310us 485.06us void faiss::gpu::calcResidual<float, bool=0>(faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::calcResidual<float, bool=0, float, int=2, bool=1, int, faiss::gpu::traits>, faiss::gpu::calcResidual<float, bool=0, int, int=1, bool=1, int, faiss::gpu::traits>, faiss::gpu::traits::DefaultPtrTraits) 0.47% 102.42ms 65536 1.5620us 1.4390us 13.088us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::parallel_for::ParallelForAgent<thrust::cuda_cub::transform::unary_transform_f<long const , int, thrust::cuda_cub::transform::no_stencil_tag, faiss::gpu::Convert<long, int>, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::transform::unary_transform_f<long const , int, thrust::cuda_cub::__transform::no_stencil_tag, faiss::gpu::Convert<long, int>, thrust::cuda_cub::__transform::always_true_predicate>, long>(int, thrust::cuda_cub::__transform::no_stencil_tag) 0.42% 91.212ms 3500 26.060us 7.2320us 76.897us void faiss::gpu::l2NormRowMajor<float, float, int, int=8, bool=0, bool=1>(faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::l2NormRowMajor<float, float, int, int=8, bool=0, bool=1, float, int=1, bool=1, float, faiss::gpu::traits>) 0.22% 47.383ms 1250 37.906us 36.320us 46.976us void faiss::gpu::pqResidualVector<float, bool=1>(faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::pqResidualVector<float, bool=1, float, int=2, bool=1, int, faiss::gpu::traits>, faiss::gpu::pqResidualVector<float, bool=1, int, int=2, bool=1, int, faiss::gpu::traits>, int, faiss::gpu::pqResidualVector<float, bool=1, float, int=4, bool=1, int, faiss::gpu::traits>) 0.10% 22.816ms 1250 18.253us 17.408us 28.064us void faiss::gpu::pass2SelectLists(faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::pass2SelectLists<int=128, int=128, int=3, bool=0, int, int=2, bool=1, int, faiss::gpu::traits>, void, faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, int, faiss::gpu::IndicesOptions, faiss::gpu::traits::DefaultPtrTraits, faiss::gpu::pass2SelectLists<int=128, int=128, int=3, bool=0, long, int=2, bool=1, int, faiss::gpu::traits>) 0.09% 18.748ms 64 292.95us 10.784us 1.2343ms void faiss::gpu::l2NormRowMajor<float, float4, int, int=8, bool=0, bool=1>(faiss::gpu::Tensor<float4, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::l2NormRowMajor<float, float4, int, int=8, bool=0, bool=1, float, int=1, bool=1, float4, faiss::gpu::traits>) 0.07% 14.318ms 16253 880ns 543ns 9.8240us [CUDA memset] 0.06% 12.170ms 6396 1.9020us 1.5680us 12.992us [CUDA memcpy DtoD] 0.05% 10.243ms 1252 8.1800us 6.4320us 17.952us void faiss::gpu::transposeAny<float, unsigned int, int=3, int=-1>(faiss::gpu::TensorInfo<float, unsigned int>, unsigned int, float) 0.04% 8.3235ms 20 416.17us 212.55us 470.72us void faiss::gpu::l2SelectMinK<float, int=512, int=8, int=128>(faiss::gpu::Tensor<float, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::l2SelectMinK<float, int=512, int=8, int=128, faiss::gpu::Tensor, int=1, bool=1, int, faiss::gpu::traits>, faiss::gpu::l2SelectMinK<float, int=512, int=8, int=128, unsigned char, int=1, bool=1, int, faiss::gpu::traits>, faiss::gpu::traits::DefaultPtrTraits, faiss::gpu::l2SelectMinK<float, int=512, int=8, int=128, int, int=2, bool=1, int, faiss::gpu::traits>, int, faiss::gpu::Tensor) 0.03% 6.2340ms 1250 4.9870us 2.5600us 12.384us void cub::DeviceScanKernel<cub::DeviceScanPolicy::Policy600, int, int, cub::ScanTileState<int, bool=1>, thrust::plus, cub::NullType, int>(cub::DeviceScanPolicy::Policy600, int, int, int, int, bool=1, cub::ScanTileState<int, bool=1>) 0.03% 5.6586ms 1250 4.5260us 2.7520us 12.160us faiss::gpu::getResultLengths(faiss::gpu::Tensor<int, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, int, int, faiss::gpu::Tensor<int, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>) 0.01% 2.4453ms 1250 1.9560us 1.4080us 12.096us void cub::DeviceScanInitKernel<cub::ScanTileState<int, bool=1>>(int, int) 0.01% 1.2798ms 211 6.0650us 4.2560us 17.760us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::parallel_for::ParallelForAgent<thrust::cuda_cub::transform::unary_transform_f<int const , long, thrust::cuda_cub::transform::no_stencil_tag, faiss::gpu::Convert<int, long>, thrust::cuda_cub::__transform::always_true_predicate>, long>, thrust::cuda_cub::transform::unary_transform_f<int const , long, thrust::cuda_cub::__transform::no_stencil_tag, faiss::gpu::Convert<int, long>, thrust::cuda_cub::__transform::always_true_predicate>, long>(long, thrust::cuda_cub::transform::no_stencil_tag) 0.01% 1.2657ms 37 34.208us 6.2080us 43.584us faiss::gpu::ivfpqAppend(faiss::gpu::Tensor<int, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::Tensor<int, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::Tensor<unsigned char, int=2, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, void**) 0.00% 491.94us 37 13.295us 3.3600us 16.448us void faiss::gpu::transposeAny<unsigned char, unsigned int, int=2, int=-1>(faiss::gpu::TensorInfo<unsigned char, unsigned int>, unsigned int, unsigned char) 0.00% 395.55us 37 10.690us 3.1040us 11.456us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::parallel_for::ParallelForAgent<thrust::cuda_cub::transform::unary_transform_f<int const , unsigned char, thrust::cuda_cub::__transform::no_stencil_tag, faiss::gpu::Convert<int, unsigned char>, thrust::cuda_cub::transform::always_true_predicate>, long>, thrust::cuda_cub::transform::unary_transform_f<int const , unsigned char, thrust::cuda_cub::__transform::no_stencil_tag, faiss::gpu::Convert<int, unsigned char>, thrust::cuda_cub::transform::always_true_predicate>, long>(unsigned char*, thrust::cuda_cub::transform::no_stencil_tag) 0.00% 328.61us 37 8.8810us 3.4240us 10.752us faiss::gpu::ivfIndicesAppend(faiss::gpu::Tensor<int, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::Tensor<int, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::Tensor<long, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::IndicesOptions, void*) 0.00% 106.91us 37 2.8890us 2.4000us 12.577us faiss::gpu::runUpdateListPointers(faiss::gpu::Tensor<int, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::Tensor<int, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::Tensor<void, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, faiss::gpu::Tensor<void, int=1, bool=1, int, faiss::gpu::traits::DefaultPtrTraits>, int, void, void) 0.00% 4.1920us 2 2.0960us 2.0160us 2.1760us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::parallel_for::ParallelForAgent<thrust::cuda_cub::uninitialized_fill::functor<thrust::device_ptr<void>, void>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr<void>, void>, unsigned long>(thrust::device_ptr<void>, void) 0.00% 1.7600us 1 1.7600us 1.7600us 1.7600us void thrust::cuda_cub::core::_kernel_agent<thrust::cuda_cub::parallel_for::ParallelForAgent<thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr, int>, unsigned long>, thrust::cuda_cub::__uninitialized_fill::functor<thrust::device_ptr, int>, unsigned long>(thrust::device_ptr, int) API calls: 76.95% 18.4609s 273799 67.425us 166ns 19.084ms cudaMemcpyAsync 13.28% 3.18609s 67037 47.527us 956ns 65.879ms cudaStreamSynchronize 3.04% 730.49ms 229747 3.1790us 2.2600us 336.55us cudaLaunchKernel 2.22% 531.73ms 8898 59.758us 1.3970us 346.10ms cudaFree 1.33% 320.06ms 1 320.06ms 320.06ms 320.06ms cudaHostAlloc 1.16% 278.80ms 661411 421ns 291ns 335.22us cudaPointerGetAttributes 0.83% 199.25ms 872592 228ns 179ns 303.96us cudaGetDevice 0.41% 97.584ms 749887 130ns 94ns 337.00us cudaGetLastError 0.15% 35.114ms 16253 2.1600us 1.1770us 55.222us cudaMemsetAsync 0.13% 30.461ms 68006 447ns 276ns 15.692us cudaOccupancyMaxActiveBlocksPerMultiprocessorWithFlags 0.12% 27.625ms 8896 3.1050us 1.4370us 229.80us cudaMalloc 0.09% 21.953ms 1 21.953ms 21.953ms 21.953ms cudaFreeHost 0.09% 20.507ms 67051 305ns 197ns 19.511us cudaDeviceGetAttribute 0.07% 16.489ms 136574 120ns 94ns 48.723us cudaPeekAtLastError 0.04% 8.6702ms 15503 559ns 350ns 312.11us cudaEventRecord 0.03% 7.0505ms 11370 620ns 321ns 312.18us cudaEventQuery 0.03% 6.0012ms 22740 263ns 204ns 15.460us cudaStreamGetCaptureInfo 0.02% 5.0335ms 8477 593ns 417ns 15.901us cudaStreamWaitEvent 0.02% 4.6039ms 7941 579ns 285ns 13.704us cudaEventCreateWithFlags 0.01% 3.1450ms 7941 396ns 248ns 3.7710us cudaEventDestroy 0.00% 239.88us 297 807ns 89ns 48.338us cuDeviceGetAttribute 0.00% 120.67us 746 161ns 93ns 5.0250us cuGetProcAddress 0.00% 75.421us 1 75.421us 75.421us 75.421us cudaGetDeviceProperties 0.00% 50.746us 3 16.915us 9.7000us 26.423us cuDeviceGetName 0.00% 26.569us 4 6.6420us 1.2320us 22.632us cudaStreamCreateWithFlags 0.00% 10.877us 4 2.7190us 1.3720us 6.5040us cudaStreamDestroy 0.00% 6.2480us 4 1.5620us 679ns 2.8250us cudaDeviceSynchronize 0.00% 5.2100us 1 5.2100us 5.2100us 5.2100us cuDeviceGetPCIBusId 0.00% 4.8250us 5 965ns 147ns 3.9070us cudaGetDeviceCount 0.00% 3.2240us 1 3.2240us 3.2240us 3.2240us cudaFuncGetAttributes 0.00% 3.0260us 2 1.5130us 1.4750us 1.5510us cuInit 0.00% 1.3430us 4 335ns 110ns 503ns cuDeviceGet 0.00% 1.3080us 5 261ns 130ns 680ns cuDeviceGetCount 0.00% 976ns 1 976ns 976ns 976ns cudaGetSymbolAddress 0.00% 681ns 3 227ns 211ns 238ns cuDeviceTotalMem 0.00% 404ns 3 134ns 120ns 157ns cuDeviceGetUuid 0.00% 241ns 2 120ns 120ns 121ns cuDriverGetVersion

yanliang567 commented 2 years ago

update the baseline first