matsui528 / nanopq

Pure python implementation of product quantization for nearest neighbor search
MIT License
323 stars 43 forks source link

Centroid of Centroids using NanoPQ #7

Closed ashleyabraham closed 4 years ago

ashleyabraham commented 4 years ago

I am looking in to do centroid of centroids using NanoPQ, is it possible?. I have a first level nanopq model M=4, K=16, D=24. The codewords that is produced is (4, 16, 6), can this output be sent as an input for the second level nanoPQ to calculate centroid of centroids? The reason for investigating centroid of centroids is due to processing large datasets and reduce processing time.

matsui528 commented 4 years ago

Not sure such a nested PQ is useful of not, becuase a PQ with an increased parameter would be usually better. But the following nested PQ should work.

import nanopq
import numpy as np

N, D = 1000, 24
X = np.random.random((N, D)).astype(np.float32)  # 1,000 24-dim vectors

# Instantiate with M=4 sub-spaces, with the number of centrods per sub-space is Ks=16
M, Ks = 4, 16
pq = nanopq.PQ(M=M, Ks=Ks)

# Train codewords
pq.fit(X)

# codewords
# The shape is (4, 16, 6), this means that:
# - 4 supspaces
# - 16 codewords for each supspace
# - A codeword is a 6-dim vector
print(pq.codewords.shape)  

# Given the codewords, train second-level PQ instances
# For each subspace, create a PQ instance, with M=2 and Ks=4
second_level_pqs = []
for m in range(M):
    second_level_pq = nanopq.PQ(M=2, Ks=4)
    second_level_pq.fit(pq.codewords[m])  # Train by corresponding codewords
    second_level_pqs.append(second_level_pq)

# Check
print(second_level_pqs[0].codewords.shape) # shape = (2, 4, 3)