nmslib / hnswlib

Header-only C++/python library for fast approximate nearest neighbors
https://github.com/nmslib/hnswlib
Apache License 2.0
4.3k stars 633 forks source link

stream load. Is it possible? #565

Open vinnitu opened 3 months ago

vinnitu commented 3 months ago

I want to load network resourse to index but it failed

import requests
import io
import pickle
import hnswlib

def get_stream(url):
    response = requests.get(url)
    stream_data = response.content
    return io.BytesIO(stream_data)

model = pickle.load(get_stream('http://example.com/model')) # it works

index = hnswlib.Index(space='cosine', dim=128)
index.load_index(get_stream('http://example.com/index.hnsw')) # doesn't work

got error

TypeError: load_index(): incompatible function arguments. The following argument types are supported:
    1. (self: hnswlib.Index, path_to_index: str, max_elements: int = 0, allow_replace_deleted: bool = False) -> None

Invoked with: <hnswlib.Index(space='cosine', dim=128)>, <_io.BytesIO object at 0x7fd364e557c0>

Is it normal idea?

vinnitu commented 3 months ago

I am not sure, but can we pass io.BytesIO as std::ifstream?

https://github.com/nmslib/hnswlib/blob/3f3429661187e4c24a490a0f148fc6bc89042b3d/hnswlib/bruteforce.h#L152

    void loadIndex(const std::ifstream &input, SpaceInterface<dist_t> *s) {
        std::streampos position;

        readBinaryPOD(input, maxelements_);
        readBinaryPOD(input, size_per_element_);
        readBinaryPOD(input, cur_element_count);

        data_size_ = s->get_data_size();
        fstdistfunc_ = s->get_dist_func();
        dist_func_param_ = s->get_dist_func_param();
        size_per_element_ = data_size_ + sizeof(labeltype);
        data_ = (char *) malloc(maxelements_ * size_per_element_);
        if (data_ == nullptr)
            throw std::runtime_error("Not enough memory: loadIndex failed to allocate data");

        input.read(data_, maxelements_ * size_per_element_);

        input.close();
    }
vinnitu commented 3 months ago

split function at first phase

    void loadStream(const std::ifstream &input, SpaceInterface<dist_t> *s) {
        readBinaryPOD(input, maxelements_);
        readBinaryPOD(input, size_per_element_);
        readBinaryPOD(input, cur_element_count);

        data_size_ = s->get_data_size();
        fstdistfunc_ = s->get_dist_func();
        dist_func_param_ = s->get_dist_func_param();
        size_per_element_ = data_size_ + sizeof(labeltype);
        data_ = (char *) malloc(maxelements_ * size_per_element_);
        if (data_ == nullptr)
            throw std::runtime_error("Not enough memory: loadIndex failed to allocate data");

        input.read(data_, maxelements_ * size_per_element_);
    }

    void loadIndex(const std::string &location, SpaceInterface<dist_t> *s) {
        std::ifstream input(location, std::ios::binary);
        std::streampos position;
        loadStream(input, s);
        input.close();
    }
vinnitu commented 3 months ago

the same things with it

https://github.com/nmslib/hnswlib/blob/3f3429661187e4c24a490a0f148fc6bc89042b3d/hnswlib/hnswalg.h#L716

vinnitu commented 3 months ago

Unfortunately, we can't just do this because functions are used.

.seekg() and .tellg() (we can simplify loading code and remove it)

and maybe std::ifstream is not compatible with io.ByteIO and we need std::istringstream

What do you think about?

drons commented 2 months ago

Take a look at #556