perone / euclidesdb

A multi-model machine learning feature embedding database
http://euclidesdb.readthedocs.io
Other
633 stars 31 forks source link

Questions about Scalability #18

Closed SynchronicitydotAI closed 5 years ago

SynchronicitydotAI commented 5 years ago

Great concept. I am interested in integrating RNN / LSTM support with this, is that on the roadmap for development? I saw anecdotal reference to it in one of the issues. Curious about using EuclidesDB for document search, potentially in the millions of documents (sentiment analysis and medical coding category correlations). I am assuming that would negate the use of your brute force method for search; how would Annoy or the other options scale with this?

perone commented 5 years ago

Hi @SynchronicitydotAI, I'm interested in incorporating other models (especially for NLP) and this will come soon or late, but there are many challenges in doing that due to the complexity of NLP pre-processing steps when compared to just image pre-processing. I'm finishing the Faiss integration and you can find more information about Faiss indexing for 1M or 1G items here. For that amount of items, you'll certainly need a lot of memory and good quantization mechanisms such as the ones present in Faiss, which will be released in the next version of EuclidesDB.

SynchronicitydotAI commented 5 years ago

It seems like Annoy has support for many millions of records, how does that compare to Faiss indexing?

For preprocessing steps I am assuming word2vec and one hot encoding steps?

perone commented 5 years ago

There is a comparison in annoy repository itself, just look the readme in the bottom. There are also probably other benchmarks around comparing them. I would say that Faiss has really much more features than annoy. Regarding the pre-processing this is very tricky, some models use BPE encodings, etc, so it can get very complex. I'll probably start support with very simple models and then expanding later.

SynchronicitydotAI commented 5 years ago

So can the current incarnation of EuclidesDB be used for CNN-based text classification models?

perone commented 5 years ago

Classification isn't the main goal, the goal of EuclidesDB is to be a feature database that is coupled with PyTorch for indexing and search. Once NLP support is integrated, of course, you can use it to serve the model and get the predictions, but it's not the main purpose of EuclidesDB, this support for serving the models are there for convenience.