triton-inference-server / server

The Triton Inference Server provides an optimized cloud and edge inferencing solution.
https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html
BSD 3-Clause "New" or "Revised" License
8.04k stars 1.44k forks source link

Integrate triton with vector database in Python backend #5614

Closed junwang-wish closed 1 year ago

junwang-wish commented 1 year ago

LLM + Vector Database (chroma, qdrant for example) is a powerful combo.

Let's say we have an in-session recommender system based on LLM where it takes "user bought , user clicked , recommend", and output " ", I would need a fast vector database (ideally hosted as a triton backend for speed) to

This is easily doable in a single-node deployed triton, where I can add chroma/qdrant as a Python backend as in-memory vector db, and initialize the db with set of products.

However, is there a better way to do this that scales across regions in k8s, and better support updating the vector db with new set of products? I presume loading vector db as a python backend is not that of a good idea then (although easiest to prototype).

Would love some pointers / suggestions on how people integrate triton with vector dbs :)

rmccorm4 commented 1 year ago

Hi @junwang-wish, thanks for raising this question. I believe there is a similar blog post about this line of design/scale here: https://developer.nvidia.com/blog/offline-to-online-feature-storage-for-real-time-recommendation-systems-with-nvidia-merlin/. Let me know if this helps at all.

CC @spartee who may have some inputs on this

Spartee commented 1 year ago

Hi @junwang-wish !! @rmccorm4 is right and we've successfully used Redis as a vector database with Triton in multiple use cases. The most built out example is in the Redis-Recsys repo. here Redis is used as a feature store (holding item and user features) as well as item embeddings for vector search.

This Python model runs the query with input of a single FP_32 vector and the output of 64 candidate ids (see config.pbtxt).

We also have multiple revisions that we did for my GTC talk where we optimized multiple revisions that you can find here. The best revision is here, however, I would not start here as it's not the most straightforward example. It combines the feature retrieval and VSS stage into a single model. This has a big impact on the overall throughput of the pipeline which you can read about here.

For using Redis VSS, you can spin up the Redis-Stack container which includes the rediSearch module with the vector database components. We also have K8's deployment with enterprise version of Redis.

junwang-wish commented 1 year ago

Thanks @Spartee @rmccorm4 for the pointers!

junwang-wish commented 1 year ago

@Spartee I watched your super informative GTC talk but have some remaining questions on how to manage Redis data store in a distributed fashion (it seems to be done via docker compose with mounted volume in your code example)

So let's use the example I posted above, we have an in-session recommender system based on LLM where it takes "user bought , user clicked , recommend", and output " ", I would need a fast vector database (ideally hosted as a triton backend for speed) to

Fetch , given product ids Retrieve similar products like and in the embedding space.

Now I trained a new embedding model that creates new versions of product_embs and user_embs, and need to update them and do AB test. How do I upload the data to distributed k8s deployed Triton+RedisVectorStore ?

Spartee commented 1 year ago

So if you plan to manage this at scale, I would highly recommend a separation of infrastructure. I understand the need for speed here, but co-locating the two on the same servers would cause resource contention.

I would start by spinning up a redis instance (redis-stack is easiest place to start) and then modify the code I sent over to make a triton instance call out to the redis instance to perform queries.

When you need to update, follow these steps

  1. Create a new index
  2. add new embeddings to that index
  3. when done indexing, alias FT.ALIAS that index to the old index
  4. delete the old index.

The triton backend in this case would be the python backend using the Redis-py client to call out to Redis to perform the queries. you could also use the C++ backend and use the C++ redis client as well.

I've had alot of success speeding up this approach by using the model_analyzer and perf_analyzer to figure out what I needed to do in terms of adding model (python backend) copies.

feel free to reach out to sam.partee at redis.com if you want to chat about it.

junwang-wish commented 1 year ago

Sounds good thanks @Spartee , for now I am using vectorDB as a sidecar container with 20 vCPUs and 50Gb Vram on g4dn.12xlarge (triton would be the main server container with 4 GPUs and ~100 Gb ram), and the ceiling of number of vectors is around 50 Million 512-dim vectors. So they are on the same instance but in different containers so there shouldn't be resource contention problem?

I am trying to achieve P95 latency < 50ms for retireving 1000 vectors from 50 million vectors with QPS ~= 1000, is this the scale that Redis vector search is designed to handle? Assuming my hardware stays the same (20 vCPU, 50Gb ram)

Spartee commented 1 year ago

As long as resources are tied (pinned) to the container that should be fine. If on the same node and you're using a single instance of Redis, you can also use the Unix socket comms which will make it faster.

50 million vectors at 512 dims won't fit on that instance without some type of dim reduction or PQ (not supported...yet). Sizing this out (back of the envelope) youd need about 200gb.

That latency and QPS would be fine, but if you need to query in large batch, I may look elsewhere at the moment. We are working with nvidia on support GPU index with RAFT, so I'll keep you updated on that, but your use case sounds like it might work better with FAISS + PQ at the moment. I believe it also has batch capability.

In general, the upper end of what Redis OSS is good for right now is about 75-100m vectors. With the NVIDA work, it'll be much higher soon but until then, I might explore FAISS. There is a merlin systems integration which may help you out too. integrates with FAISS.

junwang-wish commented 1 year ago

Thanks @Spartee , just in case I should ask, Redis-Recsys repo used Redis Stack, which has dual-licensed under RSALv2 and SSPL. Does that mean it RedisStack can be used for commercial purpose?

junwang-wish commented 1 year ago

Thx for the pointer @Spartee , but I tried Redis search, unfortunately both the indexing speed and query speed is bit slow https://github.com/RediSearch/RediSearch/issues/3528 , what was the size of your problem (number of embeddings) in Redis-Recsys repo ?

Spartee commented 1 year ago

@junwang-wish This is not legal advice, given that I'm not a lawyer, but the license only prohibits someone from hosting redis-stack as a service. If you are using Redis-Stack in a service that's not directly exposing the API as a sellable product, i.e. like vector database as a service, then you're fine.

example, you're an ecommerce company. you want to use Redis-stack as a vector database in a triton pipeline thats consumed by your web service to recommend products on your web pages. You're fine.

Example, you're amazon and you want to make some easy money by hosting OSS projects as a service. you call it Amazon Redis and it's the same API. you're not fine.

On the scaling note, indexing time currently does take longer than expected for some use cases. We usually suggest building the index first and adding documents to the index (which happens in the background). The query speed however, surprises me. We commonly have best-in-class single query latency. If you're talking about QPS, however, (as in throughput) then there is a scaling limit with the OSS offering. We are actively working on both QPS and indexing time enhancements.

dyastremsky commented 1 year ago

Closing issue due to inactivity. If you need us to reopen it for follow-up, please let us know.