Serverless friendly 1-N implemenation

martinheidegger commented 2 years ago

Common 1-N implementations such as presented in #4 use KD-trees such as sklearn's KMeans. These algorithms keep the entire clustering tree in memory to find a match.

In a lambda environment this will slow down the startup time enormously which we want to prevent by having a data structure for the embeddings that can be quickly loaded in batch for a whole set of embeddings - best through an Index stored in S3 - where the kdtree can be updated without concurrency issues (multiple lambda instances running in parallel).

We need a lookup implementation that loads as much as possible on-demand from a customizable data source and where as additions to the data cause as few writes as possible.

urbien commented 2 years ago

could the index be in DynamoDB?

martinheidegger commented 2 years ago

I identified two basic variants how we could approach this:

have an index that can be efficiently stored/operated in dynamodb or s3
have an index that can be quickly hibernated and loaded in a lambda *

* = AWS lambdas have a fixed size of 512 MB temp storage and a RAM limit as well, these would be also limiting part of the performance of this.

In practice there are a bunch of options for this:

Locality-Sensitive-Hashing (LSH)

I found that DynamoDb supports only classic indexes (string, number, date) and not specialized indexes, which would have been convenient. I have not applied this in practice but in theory we could use space-filling-curves (like the [Hilbert-curve][]) to turn the n-dimensional data into 1-dimensional data (that can be indexed). Looking for an nearest neighbor in the database would require multiple queries to DynamoDb. Without further research I can not tell without further research how many requests to DynamoDB would be necessary to find an entry and by extension how much that would cost or how fast it would be in practice. Also I havn't found a library that implements this technique in my research. But for 3 dimensional data there is a AWS tutorial for z-order index.

I like the idea of LSH for this use, but I am not sure how well it will work.

Note: Falconn is a LSH implementation that unfortunately implements the whole algorithm in-memory. It seems to be better at what it does in comparison to lshkit or slash. Note 2: tlsh may be an implementation that can be used to implement this. Note 3: @agtabesh/lsh is a direct javascript implementation. Quality not yet determined.

Faiss & Milvus

Faiss is a facebook project with a flexible (also complex) architecture that allows to specify various indices and allows the storage of Indexes not fitting into RAM. Using faiss directly is a bit tricky, though there is a reasonably good tutorial on how to use Faiss with DynamoDB.

However - instead of using faiss as is - it may be more reasonable to use Milvus. The Architecture of Milvus may be just the same thing that we would come up with. Milvus is pretty complex and takes a while to properly evaluate.

Simple mmap

Simpler than Faiss is n2, a kd-tree implementation that loads and stores the state quickly using mmap. This should be very fast and would support relatively big trees. We use an indexing (single-threaded)-process that updates the memory file and uploads it to s3 and the actual lookup in lambda's is only used to read the shared s3 file which is kept in the persistent storage of a lambda to be faster.

The problem with this implementation is that with an increasing size of the index, the cold-boot time will slow as well.

Note: ngt is an alternative implementation that works in a similar fashion. Note 2: Annoy is the most popular library for this, but it isn't made for mmap.

KDTree in DynamoDB

Another theoretical approach would be to structure the data in DynamoDb as kd-tree nodes. Kd-trees are quick to read but expensive to update. An single write/update of a kd-tree could potentially require a lot of DynamoDb updates which is why this probably not an effective strategy. But we could store kd-tree data as s3 chunks may be a viable strategy. In the past I have worked with 3-dimensional data that structure the data like this and while the updates are expensive the data can be very fast accessed without loading the entire tree into memory, potentially allowing very fast access to huge data sets.

Unfortunately In my research I haven't found good implementation for this.

tradle / KYCDeepFace