storpipfugl / pykdtree

Fast kd-tree implementation in Python
GNU Lesser General Public License v3.0
215 stars 47 forks source link

Compare performance to faiss #100

Open redhog opened 1 year ago

redhog commented 1 year ago

Add https://github.com/facebookresearch/faiss to the performance comparison matrix

djhoese commented 1 year ago

It wasn't clear to me what an equivalent workflow would be for using faiss. I don't think the originally benchmarking code in the README was put into git so we might have to come up with an equivalent piece of code.

redhog commented 1 year ago

This is how I have compared scipy:s cKDTree to Faiss previously:

import faiss
import scipy.spatial
import numpy as np
import pandas as pd
import datetime

d = 64                           # dimension
nb = 100000                      # database size
nq = 10000                       # nb of queries
np.random.seed(1234)             # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq = np.random.random((nq, d)).astype('float32')
xq[:, 0] += np.arange(nq) / 1000.

k = 4

fstart = datetime.datetime.now()
for a in range(10):
    index = faiss.IndexFlatL2(d)   # build the index
    index.add(xb)                  # add vectors to the index
    fD, fI = index.search(xq, k)     # actual search
    fend = datetime.datetime.now()
flen = fend - fstart

tstart = datetime.datetime.now()
for a in range(10):
    tree = scipy.spatial.cKDTree(xb)
    tD, tI = tree.query(xq, k)
tend = datetime.datetime.now()
tlen = tend - tstart

print("scipy.spatial.cKDTree", tlen, "faiss.IndexFlatL2", flen)

This should be directly portable to pykdtree:s query() function.

redhog commented 1 year ago

To be fair faiss has many other indices than L2, and faiss (and scipy?) are aimed at high dimensional data.

It would be nice to have the comparison code checked into git, say in a docs or contrib folder.

djhoese commented 1 year ago

Yes, I'm all for putting the code in the repository. Can L2 handle any dimensionality? Pykdtree's README used 3D data.

redhog commented 1 year ago

Yeah, faiss.IndexFlatL2 can handle any dimension. I think it's uncommon to use faiss for 2d and 3d data, much more common for > 100 dimensions (word embeddings etc). scipy.spatial.cKDTree can also handle multidimensional data, but I think it's more common to use it for spatial (2d, 3d) data, that's at least what I've been using it for.

djhoese commented 1 year ago

Yeah pykdtree is used heavily by the pyresample and satpy projects I maintain and they are entirely used for Earth geolocation (X, Y, Z). I don't think I'll have time to dedicate to running these comparisons and collecting the statistics for each variation so if someone else has time for this that'd be great. I think the timeit module should be used to run each case multiple time and each should probably be run in a separate script to avoid the memory usage by one persist and affecting the next.

That is of course only testing execution time. It'd be nice to have an estimate on memory usage too.