nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors
https://github.com/nmslib/hnswlib
Apache License 2.0
4.34k stars 641 forks source link

support index loading from google cloud storage(gcs) #336

Open kimmyjin opened 3 years ago

kimmyjin commented 3 years ago

I used to develop everything on my local machine, but is moving to gcs now. I am seeing errors loading model index from gcs. Wondering is there support index loading from cloud storage?

yurymalkov commented 3 years ago

Hi @kimmyjin, I think you should be able to use pickle together with gcs.

kimmyjin commented 3 years ago

HI @yurymalkov, are you suggesting we directly loading .pkl into gcs instead of using the .bin? https://github.com/nmslib/hnswlib/blob/1866a1df7961c42cd4efb0c8ffc665d6209447f9/examples/pyw_hnswlib.py#L41-L44

yurymalkov commented 3 years ago

You can use something like this:


import hnswlib
import numpy as np
import pickle
import tensorflow as tf

dim = 128
num_elements = 10000

# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))
ids = np.arange(num_elements)

# Declaring index
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip

# Initializing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)

# Element insertion (can be called several times):
p.add_items(data, ids)

# Controlling the recall by setting ef:
p.set_ef(50) # ef should always be > k

# Query dataset, k - number of closest elements (returns 2 numpy arrays)
labels, distances = p.knn_query(data, k = 1)

# Index objects support pickling
# WARNING: serialization via pickle.dumps(p) or p.__getstate__() is NOT thread-safe with p.add_items method!
# Note: ef parameter is included in serialization; random number generator is initialized with random_seed on Index load

# saving to gcs:
with tf.io.gfile.GFile("gs://bucket/tempindex.pickle", "wb") as f:
    pickle.dump( p, f)
# loading from gcs
with tf.io.gfile.GFile("gs://bucket/tempindex.pickle", "rb") as f:
    p_copy = pickle.load(f)

### Index parameters are exposed as class properties:
print(f"Parameters passed to constructor:  space={p_copy.space}, dim={p_copy.dim}") 
print(f"Index construction: M={p_copy.M}, ef_construction={p_copy.ef_construction}")
print(f"Index size is {p_copy.element_count} and index capacity is {p_copy.max_elements}")
print(f"Search speed/quality trade-off parameter: ef={p_copy.ef}")```
kimmyjin commented 3 years ago

This is awesome. Will definitely test out! Thank you so much @yurymalkov.