召回率问题 - Githubissues

realdalabengba commented 3 years ago

Hi，用了如下的建表语句，数据2亿，召回1000个。结果只有80%与暴力检索的结果一致，请问为什么这么低呢？ { "name": "piv_vec", "partition_num": 1, "replica_num": 1, "engine": { "name": "gamma", "index_size": 3276800, "retrieval_type": "IVFPQ", "retrieval_param": { "metric_type": "L2", "ncentroids": 65536, "nsubvector": 128, "hnsw": { "nlinks": 32, "efConstruction": 200, "efSearch": 64 } } }, "properties": { "pic_vector": { "type": "vector", "dimension": 512, "store_type": "RocksDB", "store_param": { "cache_size": 1024, "compress": { "rate": 16 } } } } }

zcdb commented 3 years ago

向量维度多少呢？搜索设置的nprobe多少呢？一般而言ivfpq的recall@1是不会很高的，80%左右应该是正常的，另外加上重排即搜索的时候设置"quick": false，可以提高recall@1

realdalabengba commented 3 years ago

向量512维度，nprbe 100、200、800都试过，变化不大。搜索设置的"quick": false

ljeagle commented 3 years ago

向量512维度，nprbe 100、200、800都试过，变化不大。搜索设置的"quick": false

总共有2亿数据，请问暴力检索的结果是怎么得到的呢？

realdalabengba commented 3 years ago

用is_brute_search=1得到的

rrjia commented 3 years ago

我也遇到了同样的问题，1亿数据量下，用特征向量检索，尽然连本身也搜索不到建表语句

 "dynamic_schema": "strict",
        "partition_num": 3,
        "replica_num": 2,
        "engine": {"name": "gamma",
                   "index_size": 81920,  # [ncentroids * 39, ncentroids * 256]
                   "max_size": 1000000000,
                   "id_type": "Long",
                   "retrieval_type": "IVFPQ",
                   "retrieval_param": {
                       "metric_type": "InnerProduct",
                       "ncentroids": 2048,
                       "nsubvector": 64,
                       "hnsw": {
                           "nlinks": 32,
                           "efConstruction": 200,
                           "efSearch": 64
                       },
                       "opq": {
                           "nsubvector": 64
                       }
                   }
                  },

搜索时 nprobe 都已经设置成2048了

zcdb commented 3 years ago

我也遇到了同样的问题，1亿数据量下，用特征向量检索，尽然连本身也搜索不到建表语句

 "dynamic_schema": "strict",
        "partition_num": 3,
        "replica_num": 2,
        "engine": {"name": "gamma",
                   "index_size": 81920,  # [ncentroids * 39, ncentroids * 256]
                   "max_size": 1000000000,
                   "id_type": "Long",
                   "retrieval_type": "IVFPQ",
                   "retrieval_param": {
                       "metric_type": "InnerProduct",
                       "ncentroids": 2048,
                       "nsubvector": 64,
                       "hnsw": {
                           "nlinks": 32,
                           "efConstruction": 200,
                           "efSearch": 64
                       },
                       "opq": {
                           "nsubvector": 64
                       }
                   }
                  },

搜索时 nprobe 都已经设置成2048了

ncentroids较小的情况下，组合使用hnsw不是很合适另外你说的召回不到自身，测试了多少条呢，最好多测试一些来评估召回

rrjia commented 3 years ago

测试了100条，没有一条能搜到自己的

realdalabengba commented 3 years ago

@ljeagle 你好，和is_brute_search=1对比没问题吧？召回率低可能是什么影响的呢？index_size、ncentroids设置的不合理吗？或分片数据量太大的原因？

realdalabengba commented 3 years ago

测试了100条，没有一条能搜到自己的

“没有一条能搜到自己的”：如果开启了压缩，存储时向量尾数会变得，可以用检索出来的向量去检索试试，看看有没有自身。

rrjia commented 3 years ago

没有开启压缩，用检索出来的向量去检索结果也是一样的，top100一个正确的都有，连自己都检索不到，简直离谱

ljeagle commented 3 years ago

@ljeagle 你好，和is_brute_search=1对比没问题吧？召回率低可能是什么影响的呢？index_size、ncentroids设置的不合理吗？或分片数据量太大的原因？

猜测很可能是数据量过大，模型设置参数不太合理，导致召回不理想。如果在2亿数据里暴力搜索，耗时很慢吧？多长时间返回？然后如果数据量换成1kw，用同样的模型召回怎么样呢？

ljeagle commented 3 years ago

我们之前推荐的hnsw以及ivf参数，都是在2000w左右数据情况下测得的，记得召回95%以上。2亿数据量翻了近10倍，用同样的模型参数，训练可能都不充分。

realdalabengba commented 3 years ago

2亿数据量暴力检索很慢，十几分钟左右返回。请问这个数据量级是否有推荐的参数配置呢？

realdalabengba commented 2 years ago

换了一个配置（表结构如下，总共写入460万数据量）。做了两次测试，每次测试有100条检索向量，nprobe=100,recall=100。结果和暴力搜索（设置is_brute_search=1）对比，分别是90.5%、81.14%。也不太理想呢？

{
  "name": "pic512",
  "partition_num": 3,
  "replica_num": 3,
  "engine": {
    "name": "gamma",
    "index_size": 1228800,
    "id_type": "Long",
    "retrieval_type": "IVFPQ",
    "retrieval_param": {
      "metric_type": "L2",
      "ncentroids": 8192,
      "nsubvector": 128
    }
  },
  "properties": {
    "pic_vector": {
      "type": "vector",
      "dimension": 512,
      "store_type": "RocksDB",
      "store_param": {
        "cache_size": 128,
        "compress": {
          "rate": 16
        }
      }
    }
  }
}

vearch / vearch

召回率问题 #533