stepstone-tech / hnswlib-jna

Native-Like Performance for Nearest Neighbor Search in Java Applications using Hnswlib and Java Native Access
Apache License 2.0
32 stars 8 forks source link

[Question] Memory footprint and is the project still maintained #12

Open nemo83 opened 2 years ago

nemo83 commented 2 years ago

Hello,

I have been long time Spotify ANNoy user, and I've recently come across HSNW. I'm a Java guy so I obiously took a look at the full java implementation, and then this project too.

I'm surprised by the performances, but even more, by the memory footprint used by this library and I'm wondering if someone could validate the number I'm seeing.

I've loaded about 2.5 million tensors with 1024 dimensions and from what I can see on JvisualVM, the memory consumption of a Spring Java API in idle (just index loaded), is about 500mb. Is that possible? (see picture below)

Screenshot 2022-09-16 at 19 33 03

Where are tensors data stored? On disk?

I also have another question, is this project still maintained?

Thanks, Gio

hussamaa commented 2 years ago

Hi Gio, hope you're well.

Back then when we wrote the binding we were mostly aiming to have the lower query time hnswlib provided. I personally don't remember figures we could use as comparison but I imagine we would have a similar memory footprint using the native library. If not, we could optimize that.

I haven't been working with hnswlib for a while but from what I remember the references are kept in memory. You have the option to write the state of your index in disk if you want (and restore it).

Have you tried using the java implementation? how were your figures?

I don't think this project is being actively maintained anymore but it should be working fine still. Hnswlib released new updates and improvements which weren't tested with this binding. If you want, I could have a look into that.

Wishing you a great weekend.

Best regards, Hussama

nemo83 commented 2 years ago

Thanks for the very quick reply,

I have indeed tested the java client, but using double rather than float, and I could not fully build the index with 10GB of heap. I'm in the middle of rebuilding the full 2.5milion w/ float, but the current projections is 250k seems to use 1.2 gig, so it should all fit in 12GB. This JNA implementation does everything w/o, apparently, ever exceeding 1GB. I guess the C native code allocate and deallocate memory as it goes. What I don't understand is, when I dump the index on disk it is about 11GB, but in memory is less than 1GB... where are the other 11GB saved?

We need to replace Spotify ANNoy, and after some very quick tests, can appreciate the power of HNSW, in terms of performances and accuracies (results a much better). I need to pick up the right HNSW library/framework to replace ANNoy and I was wondering if this project is still maitained because seems to be the best java solution.

If you were so kind to test it with the latest hnswlib it would be amazing, and I would be delighted to pair-program/review the code, so that I could start learning and contributing.

I would be using this library in argusnft.com, and AI Powered NFT Fake Detection Platform and our goal is to extract embeddings for all the (picture-based) nft from all the blockchains, and load them in ANN indexes. A pretty huge objective. If you have recommentions or fancy a chat let me know!

You too enjoy the weekend and thanks for the superquick reply.

☮️

hussamaa commented 2 years ago

We also moved from annoy back then because updating the index in runtime was not possible and some other problems.

Hmmmmmmmm, yeah, that sounds suspicious indeed 😛 but I believe if we had an issue we would have spotted already in production. You're right, the memory allocation/free is taking place as it runs (can't guarantee there are no memory leaks) and when dumping the index it is most likely generating and writing down the entire state space and parameters.

Talking about the Java one, I remember it took a while to build the indexes with large datasets (even in parallel) and the query time was so performant either (in comparison to the native) which made us try out other solutions. Building the index natively (in python) and restoring in few secs on Java was also a big plus for us that we got using the binding.

That sounds a cool project. I'm not a data science / ml hero; our wizard was @alexcarterkarsus. Alex, would you have any recommendation? Would hnswlib be your go to library for Gio's use case?

--

I had a quick look and since I left STST, I'm afraid I can't push updates to the library. I will fork, upgrade hnswlib and open a pull request once ready.

Which platform would you be using the library on? WIn64, AMD64? ARM64?

Have a nice weekend! Take care!

nemo83 commented 2 years ago

Thanks again for the detailed response, it's really helping.

Nice to meet you @alexcarterkarsus !

I would have another couple of questions:

  1. What was the type of document you guys were indexing? And the tensor dimension?
  2. Is the index file dump interoperable among hnsw libraries? java/jna/c/python?
  3. Where were you storing the tensors for long time storage? Relational db? S3? Asking because we currently have 2.5m tensor now, and possibly close to 50m in just a couple of months and we were wondering if you guys had any recommendations.

So many questions! But very exciting space and this project is providing me with so much knowledge! Thank you!

hussamaa commented 2 years ago

1: leave it to alex 🧠 😛

2: yeah; they use the same code underneath so it is indeed interoperable across languages 🙌. I'm not so sure if is across architectures, I'm afraid it isn't.

3: our use case, no, the model had periodically recreated due to the constant new data (it was prepared separately and stored somewhere in the cloud indeed). For yours, I can see it would be more static (keeping track of existing nfts and adding new), right?

Yeah, there are some challenges to keep in mind, the bigger the index, higher is the insertion and query time and maybe the fine tuning of the hnswlib parameters can help. Handling 7M was one our acceptance criteria and it fit for us.

Not sure if you could organize and split it in different models?

alexcarterkarsus commented 2 years ago

hey all, sorry for the late reply! regarding question 1: we built our vectors from a word2vec model, and the dimension is either 50 or 100. The biggest index has around 10M vectors, and it is still quite performant. cheers Alex

On Sat, Sep 17, 2022 at 11:08 AM Hussama Ismail @.***> wrote:

1: leave it to alex 🧠 😛

2: yeah; they use the same code underneath so it is indeed interoperable across languages 🙌. I'm not so sure if is across architectures, I'm afraid it isn't.

3: our use case, no, the model had periodically recreated due to the constant new data (it was prepared separately and stored somewhere in the cloud indeed). For yours, I can see it would be more static (keeping track of existing nfts and adding new), right?

Yeah, there are some challenges to keep in mind, the bigger the index, higher is the insertion and query time and maybe the fine tuning of the hnswlib parameters can help. 7M was one our acceptance criteria and it was still working fine for us.

Not sure if you could organize and split it in different models?

— Reply to this email directly, view it on GitHub https://github.com/stepstone-tech/hnswlib-jna/issues/12#issuecomment-1250034197, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKRX3UFEEJEWYGOURX66LHDV6WC7PANCNFSM6AAAAAAQOSCPE4 . You are receiving this because you were mentioned.Message ID: @.***>