Open kimmyjin opened 3 years ago
Hi @kimmyjin, I think you should be able to use pickle together with gcs.
HI @yurymalkov, are you suggesting we directly loading .pkl
into gcs instead of using the .bin
?
https://github.com/nmslib/hnswlib/blob/1866a1df7961c42cd4efb0c8ffc665d6209447f9/examples/pyw_hnswlib.py#L41-L44
You can use something like this:
import hnswlib
import numpy as np
import pickle
import tensorflow as tf
dim = 128
num_elements = 10000
# Generating sample data
data = np.float32(np.random.random((num_elements, dim)))
ids = np.arange(num_elements)
# Declaring index
p = hnswlib.Index(space = 'l2', dim = dim) # possible options are l2, cosine or ip
# Initializing index - the maximum number of elements should be known beforehand
p.init_index(max_elements = num_elements, ef_construction = 200, M = 16)
# Element insertion (can be called several times):
p.add_items(data, ids)
# Controlling the recall by setting ef:
p.set_ef(50) # ef should always be > k
# Query dataset, k - number of closest elements (returns 2 numpy arrays)
labels, distances = p.knn_query(data, k = 1)
# Index objects support pickling
# WARNING: serialization via pickle.dumps(p) or p.__getstate__() is NOT thread-safe with p.add_items method!
# Note: ef parameter is included in serialization; random number generator is initialized with random_seed on Index load
# saving to gcs:
with tf.io.gfile.GFile("gs://bucket/tempindex.pickle", "wb") as f:
pickle.dump( p, f)
# loading from gcs
with tf.io.gfile.GFile("gs://bucket/tempindex.pickle", "rb") as f:
p_copy = pickle.load(f)
### Index parameters are exposed as class properties:
print(f"Parameters passed to constructor: space={p_copy.space}, dim={p_copy.dim}")
print(f"Index construction: M={p_copy.M}, ef_construction={p_copy.ef_construction}")
print(f"Index size is {p_copy.element_count} and index capacity is {p_copy.max_elements}")
print(f"Search speed/quality trade-off parameter: ef={p_copy.ef}")```
This is awesome. Will definitely test out! Thank you so much @yurymalkov.
I used to develop everything on my local machine, but is moving to gcs now. I am seeing errors loading model index from gcs. Wondering is there support index loading from cloud storage?