treygrainger / ai-powered-search

The codebase for the book "AI-Powered Search" (Manning Publications, 2024)
https://aipoweredsearch.com
152 stars 37 forks source link

Add support for Weaviate #180

Open hsm207 opened 3 weeks ago

hsm207 commented 3 weeks ago

Signed-off-by: hsm207 hsm207@users.noreply.github.com

treygrainger commented 1 week ago

@hsm207 - Awesome to see you working on this. If you have any questions, don't hesitate to reach out.

After doing several other implementations, here's a bit of an implementation checklist for key things you'll come across during the implementation:

Dockerfile / Docker compose configuration Install Spark connector (inside the aips-notebooks Dockerfle) Collection management: creation/deletion/healthcheck Collection schemas: Primative field types: text, string, keyword, boolean, integer, double location coordinate field dense vector field: dimensions (512, 768), vector encoding/quantization (1bit, 32 bits), and dot_product similarity tokenizers/filters: comma delimited, lower case, whitespace/punctuation, NGram, delimited payload

Query functionality: sorting, filtering, limit, query fields, return fields multi-field search AND/OR/NOT operators minimum phrase matching query time boosting index time boosting vector search reranking by query highlighting debug/explain spell check/autocomplete

There are some other things like hybrid search (reciprocal rank fusion) that are implemented at the Collection level already generically, but that you can override in the WeaviateCollection to push down into the engine, since Weaviate has native support for that built in.

As mentioned in the /engines/README.md, the LTR implementation is required, but can be done outside the engine. Happy to chat with you on this if you need a generic implementation. The SparseLexicalSemanticSearch implementation is likewise required, but it's just crafting some very specific Weaviate query syntax for a handful of specific query patterns (popularity boosting, geo radius filtering, etc.) I wouldn't worry about the EntityExtractor or the SemanticKnowledgeGraph, as most engines don't have this built in and you just treat this as an external library call.

At any rate, hope that's helpful. Let us know if you have any questions we can assist with!