stepstone-tech / hnswlib-jna

Native-Like Performance for Nearest Neighbor Search in Java Applications using Hnswlib and Java Native Access
Apache License 2.0
32 stars 8 forks source link

[Question] What's the corresponding add_items in this project compared to nmslib/hnswlib #13

Open LizzyMiao opened 1 year ago

LizzyMiao commented 1 year ago

hi I am currently working on a project which needs millions sometimes even billions of vectors to be inserted to build up a graph, and I follow the example.py in https://github.com/nmslib/hnswlib/tree/master with 4000K vectors like below code

p = hnswlib.Index('l2', dim)
print("before build ", datetime.datetime.now())
p.init_index(max_elements = num_elements, ef_construction = 128, M = 16)
p.add_items(vectorNP, ids)
p.save_index("/Users/XXX/Projects/builder/hnsw-embedding-test/python_test/combined.bin")

it took around 2 mins to finish,

but when use with libhnswlib-jna-x86-64 with 16 cores, by

      val hnswIndex = new ConcurrentIndex(SpaceName.L2, dimension)
      hnswIndex.initialize(3890521, 16, 128, 42)
      val embeddingRecordsPar = parquet4sReader.toList.par
      embeddingRecordsPar.tasksupport = new ForkJoinTaskSupport(new ForkJoinPool(16))
      embeddingRecordsPar.foreach{ eb =>
        val ba = eb.vectors.head
        if (ba.length > 0) {
          val vector = RawEmbedding.toVector(RichByteArray(ba).asByteBuffer, dimension, "float16")
          hnswIndex.addNormalizedItem(vector, i)
          i = i + 1
        }
      }

it is around 15-16mins (same time cost if I change ConcurrentIndex into Index or use Index.synchronizedIndex), all above two part of codes runnning in my local machine, I'm wondering if there is same function like add_items in this hnswlib-jna or any other ways that can faster the speed of building up graph?