patterns-ai-core / langchainrb

Build LLM-powered applications in Ruby
https://rubydoc.info/gems/langchainrb
MIT License
1.37k stars 193 forks source link

Vector search with Qdrant doesn't create vectors #221

Closed gottlike closed 1 year ago

gottlike commented 1 year ago

I never used Qdrant, but when using this simple example:

openai = Langchain::LLM::OpenAI.new(api_key: ENV['OPENAI_API_KEY'])

qdrant = Langchain::Vectorsearch::Qdrant.new(
  url: ENV["QDRANT_URL"],
  api_key: nil,
  index_name: ENV["QDRANT_INDEX"],
  llm: openai
)

qdrant.create_default_schema

qdrant.add_texts(
  texts: [
      "Begin by preheating your oven to 375°F (190°C). Prepare four boneless, skinless chicken breasts by cutting a pocket into the side of each breast, being careful not to cut all the way through. Season the chicken with salt and pepper to taste. In a large skillet, melt 2 tablespoons of unsalted butter over medium heat. Add 1 small diced onion and 2 minced garlic cloves, and cook until softened, about 3-4 minutes. Add 8 ounces of fresh spinach and cook until wilted, about 3 minutes. Remove the skillet from heat and let the mixture cool slightly.",
      "In a bowl, combine the spinach mixture with 4 ounces of softened cream cheese, 1/4 cup of grated Parmesan cheese, 1/4 cup of shredded mozzarella cheese, and 1/4 teaspoon of red pepper flakes. Mix until well combined. Stuff each chicken breast pocket with an equal amount of the spinach mixture. Seal the pocket with a toothpick if necessary. In the same skillet, heat 1 tablespoon of olive oil over medium-high heat. Add the stuffed chicken breasts and sear on each side for 3-4 minutes, or until golden brown."
  ]
)

qdrant.similarity_search(
  query: 'chicken',
  k: 5
)

I get the following result back:

{"result"=>
  [{"id"=>"6c0a523b-adea-4061-88c5-82a62a315339",
    "version"=>0,
    "score"=>0.7885697,
    "payload"=>
     {"content"=>
       "Begin by preheating your oven to 375°F (190°C). Prepare four boneless, skinless chicken breasts by cutting a pocket into the side of each breast, being careful not to cut all the way through. Season the chicken with salt and pepper to taste. In a large skillet, melt 2 tablespoons of unsalted butter over medium heat. Add 1 small diced onion and 2 minced garlic cloves, and cook until softened, about 3-4 minutes. Add 8 ounces of fresh spinach and cook until wilted, about 3 minutes. Remove the skillet from heat and let the mixture cool slightly."},
    "vector"=>nil},
   {"id"=>"cb2d6bda-4cec-4226-94a3-11596af910f3",
    "version"=>0,
    "score"=>0.78527427,
    "payload"=>
     {"content"=>
       "In a bowl, combine the spinach mixture with 4 ounces of softened cream cheese, 1/4 cup of grated Parmesan cheese, 1/4 cup of shredded mozzarella cheese, and 1/4 teaspoon of red pepper flakes. Mix until well combined. Stuff each chicken breast pocket with an equal amount of the spinach mixture. Seal the pocket with a toothpick if necessary. In the same skillet, heat 1 tablespoon of olive oil over medium-high heat. Add the stuffed chicken breasts and sear on each side for 3-4 minutes, or until golden brown."},
    "vector"=>nil}],
 "status"=>"ok",
 "time"=>0.000397917}

Note that the vectors are nil for both results. The embeddings are apparently created, since it takes a bit to ingest the text samples, but the vectors are not put into Qdrant.

Another thing to improve: Using qdrant.create_default_schema throws an error when the collection was already created. Would make sense to check first if the collection exists and only then try to create it.

mattlindsey commented 1 year ago

@gottlike I can see that the result that comes back does include content for 2 recipes, so the vectors are making it into Qdrant. Is the expectation that you get back the vectors themselves too? Sorry, I'm not too familiar with this, but I can probably help.

andreibondarev commented 1 year ago

I believe this call needs to include with_vector: true. See the gem docs: https://github.com/andreibondarev/qdrant-ruby#points

gottlike commented 1 year ago

As far as I understand Qdrant offers "normal" search via payload data and vector search functionality. I'm no expert myself, but from what I get back from Qdrant it just seems that it's not having any vectors/embeddings. Also in the GUI there's no vectors for the items:

image

andreibondarev commented 1 year ago

I just hardcoded it, it's on the main branch. Could you please try it out?

mattlindsey commented 1 year ago

@andreibondarev That worked for me, but I assume someone should check in the GUI. I don't know how to check that yet.

gottlike commented 1 year ago

Yep, it worked! Still nothing in the Qdrant GUI, but the response I'm getting back now is looking good.

Then the only thing left would be a check for existing collections to prevent the error.

mattlindsey commented 1 year ago

@gottlike Does the error on create_default_schema cause a problem for you?


3.2.1 :023 > qdrant.create_default_schema
 => {"status"=>{"error"=>"Wrong input: Collection `Recipe` already exists!"}, "time"=>0.000206}
andreibondarev commented 1 year ago

@gottlike What if there exists a collection with the same name but different config? Any reason why this error wouldn't be handled in your application?

gottlike commented 1 year ago

It's not really a problem, but in my eyes it would be much cleaner to not catch errors and do a proper check instead. So maybe it would be enough to just have methods for checking and creating a collection, instead of the create_default_schema?

andreibondarev commented 1 year ago

@gottlike Want to take a stab at creating the PR? I would just encourage thinking through how this same method can also exist for all of the other supported vector search DBs since we aim to have a common interface here for all of them.

gottlike commented 1 year ago

@andreibondarev Not sure if it fits the other databases, but I'll have a look tomorrow and see if I can come up with something usable.

gottlike commented 1 year ago

@andreibondarev I just opened a PR for handling this a bit smoother in Qdrant. Is something like this feasible for the other databases, too?

Edit: Oops, just deleted my fork accidentally, but the code is still there for you to check.

andreibondarev commented 1 year ago

@gottlike I think instead maybe we can introduce a def schema to just get an existing schema or nil back. You can then use it to decide whether you want to call create_default_schema() or not. How does that sound?

gottlike commented 1 year ago

@andreibondarev Sounds like an even better solution 👍

andreibondarev commented 1 year ago

@gottlike I went ahead and added the get_default_schema() method to most of the vector DBs: https://github.com/andreibondarev/langchainrb/pull/235.

Please let us know if there's any other issues! I'm closing this issue for now.

gottlike commented 1 year ago

@andreibondarev Awesome, thanks!