Open eduard93 opened 4 years ago
convert_list_to_vector
maybe?
for (int z = 0; z < f; z++) {
PyObject *key = PyInt_FromLong(z);
PyObject *pf = PyObject_GetItem(v, key);
(*w)[z] = PyFloat_AsDouble(pf);
Py_DECREF(key);
Py_DECREF(pf);
}
https://github.com/spotify/annoy/blob/master/src/annoymodule.cc#L310
Upgrading to Numpy 1.19.1 did not help.
odd. my guess is that this is something on the tensorflow side. maybe getting it item by item causes some sort of CPU<->GPU transfer that requires a context switch?
Might be a tensorflow issue. It is definitely not a CPU<->GPU issue as my test rig does not have a GPU.
The issue is present in both Tensorflow verisons. Tested in Docker.
Tensorflow: 1.15.2 (tensorflow/tensorflow:1.15.2-py3-jupyter
):
Tensorflow: 2.3.0 (tensorflow/tensorflow:latest-jupyter
):
Amended the script in OP by adding:
tf.compat.v1.enable_eager_execution()
I experience the same issue using pytorch.Tensor
, even though the tensors are on my CPU.
Here are some benchmarks:
embedding_dim = 4000
embeddings = torch.rand(4000, embedding_dim, dtype=torch.float32)
embeddings.shape
# Timeit with raw tensors on cpu
%%timeit -n 2 -r 5
nn = AnnoyIndex(embedding_dim, metric="angular")
for idx, vector in enumerate(embeddings):
nn.add_item(idx, vector)
nn.build(10)
>>> 16.9 s ± 111 ms per loop (mean ± std. dev. of 5 runs, 2 loops each)
# Timeit with raw tensors converted to numpy
%%timeit -n 2 -r 5
nn = AnnoyIndex(embedding_dim, metric="angular")
for idx, vector in enumerate(embeddings.numpy()):
nn.add_item(idx, vector)
nn.build(10)
>>> 968 ms ± 4.47 ms per loop (mean ± std. dev. of 5 runs, 2 loops each)
It looks like the iteration trough a numpy array is in generell faster, but I am not sure if this explains the difference.
# Iterate through tensor
%%timeit -n 10 -r 50
for idx, vector in enumerate(embeddings):
pass
>>> 2.49 ms ± 206 µs per loop (mean ± std. dev. of 50 runs, 10 loops each)
# Iterate through np.array
%%timeit -n 10 -r 50
for idx, vector in enumerate(embeddings.numpy()):
pass
>>> 216 µs ± 14.6 µs per loop (mean ± std. dev. of 50 runs, 10 loops each)
Consider the following code:
It creates a list of 100 tensors, loads them into annoy index and builds the index. This takes a minute on Intel Core i5 3570K (3.40 GHz).
However, if the tensors are converted to numpy array the operation takes 0.02 seconds.
The current workaround is to call
numpy()
on tensor before passing it toadd_item
:Tensorflow:
Numpy:
Any idea as to why this happens?
Versions: