spotify / annoy

Approximate Nearest Neighbors in C++/Python optimized for memory usage and loading/saving to disk
Apache License 2.0
13.19k stars 1.17k forks source link

add_item on tensorflow tensor is extremely slow #498

Open eduard93 opened 4 years ago

eduard93 commented 4 years ago

Consider the following code:

from annoy import AnnoyIndex
import tensorflow as tf
from time import perf_counter

tf.compat.v1.enable_eager_execution()

dims = 1792
trees = 10000
features = []

for key in range(0, 100):
    features.append(tf.random.uniform([dims]))

t1 = perf_counter()

t = AnnoyIndex(dims, metric='angular')

for key, feature in enumerate(features):
    t.add_item(key, feature)

t2 = perf_counter()

t.build(trees)

t3 = perf_counter()

print(f"Vector add: {t2 - t1:.2f}")
print(f"Index build: {t3 - t2:.2f}")

It creates a list of 100 tensors, loads them into annoy index and builds the index. This takes a minute on Intel Core i5 3570K (3.40 GHz).

However, if the tensors are converted to numpy array the operation takes 0.02 seconds.

The current workaround is to call numpy() on tensor before passing it to add_item:

    t.add_item(key, feature.numpy())

Tensorflow:

Numpy:

Any idea as to why this happens?

Versions:

eduard93 commented 4 years ago

convert_list_to_vector maybe?

  for (int z = 0; z < f; z++) {
    PyObject *key = PyInt_FromLong(z);
    PyObject *pf = PyObject_GetItem(v, key);
    (*w)[z] = PyFloat_AsDouble(pf);
    Py_DECREF(key);
    Py_DECREF(pf);
  }

https://github.com/spotify/annoy/blob/master/src/annoymodule.cc#L310

eduard93 commented 4 years ago

Maybe relevant?

Upgrading to Numpy 1.19.1 did not help.

erikbern commented 4 years ago

odd. my guess is that this is something on the tensorflow side. maybe getting it item by item causes some sort of CPU<->GPU transfer that requires a context switch?

eduard93 commented 4 years ago

Might be a tensorflow issue. It is definitely not a CPU<->GPU issue as my test rig does not have a GPU.

eduard93 commented 4 years ago

The issue is present in both Tensorflow verisons. Tested in Docker.

Tensorflow: 1.15.2 (tensorflow/tensorflow:1.15.2-py3-jupyter):

Tensorflow: 2.3.0 (tensorflow/tensorflow:latest-jupyter):

Amended the script in OP by adding:

tf.compat.v1.enable_eager_execution()
Maxl94 commented 1 year ago

I experience the same issue using pytorch.Tensor, even though the tensors are on my CPU.

Here are some benchmarks:


embedding_dim = 4000

embeddings = torch.rand(4000, embedding_dim, dtype=torch.float32)
embeddings.shape

# Timeit with raw tensors on cpu
%%timeit -n 2 -r 5
nn = AnnoyIndex(embedding_dim, metric="angular")

for idx, vector in enumerate(embeddings):
    nn.add_item(idx, vector)

nn.build(10)

>>> 16.9 s ± 111 ms per loop (mean ± std. dev. of 5 runs, 2 loops each)

# Timeit with raw tensors converted to numpy
%%timeit -n 2 -r 5
nn = AnnoyIndex(embedding_dim, metric="angular")

for idx, vector in enumerate(embeddings.numpy()):
    nn.add_item(idx, vector)

nn.build(10)

>>> 968 ms ± 4.47 ms per loop (mean ± std. dev. of 5 runs, 2 loops each)

It looks like the iteration trough a numpy array is in generell faster, but I am not sure if this explains the difference.

# Iterate through tensor
%%timeit -n 10 -r 50
for idx, vector in enumerate(embeddings):
    pass

>>> 2.49 ms ± 206 µs per loop (mean ± std. dev. of 50 runs, 10 loops each)

# Iterate through np.array
%%timeit -n 10 -r 50
for idx, vector in enumerate(embeddings.numpy()):
    pass

>>> 216 µs ± 14.6 µs per loop (mean ± std. dev. of 50 runs, 10 loops each)