Localization strategy based on semantic vector search

Let's add another implementation for LocalizationStrategy based on semantic vector search.

Implementation plan details:

Langchain: Use langchain abstractions for building this. E.g.,
- use from langchain_core.embeddings import Embeddings as the abstraction for embeddings.
- use from langchain_core.documents import Document as the abstraction for the document to be indexed
- use from langchain_core.vectorstores import VectorStore as the abstraction for the vector store
Milvus: Use from langchain_milvus import Milvus as the vector store
Embedding model:
- add an embedding task to LLM model configurations. Each provider may supply a model_name for embeddings.
- expose a factory method in the LLM interface to fetch embeddings.
  - for openai as the provider, the factory should return from langchain_openai import OpenAIEmbeddings
  - for ollama as the provider, the factory should return from langchain_ollama import OllamaEmbeddings
  - for others, we'll assume from langchain_huggingface import HuggingFaceEmbeddings
Vector store creation and storage:
- create a vector store per project. This will not exist at the time of initial project onboarding.
- add the embedding for the semantic description document (for each code file) in the vector store
  - add the relative file path (in the GitHub repository) as metadata to the Document being embedded and stored in vector store
- save the vector store in the metadata folder
- for any subsequent use load the already stored vector store (instead of creating it)
- for incremental updates replace the already existing document in the vector store.
Semantic vector search strategy:
- extend from LocalizationStrategy
- construct using project
- use similarity search on vector store from project to get localization results
- note the result documents should already have in its metadata the file paths.

To implement a localization strategy based on semantic vector search, we need to create a new class that extends LocalizationStrategy and utilizes Langchain's abstractions for embeddings, documents, and vector stores. We'll also need to integrate Milvus for storing and querying the vector representations of code files.

Here's a step-by-step outline with code snippets for the key changes:

Extend LocalizationStrategy with Semantic Vector Search:

Create a new class SemanticVectorSearchLocalization in localization_strategy.py:

from langchain_core.vectorstores import VectorStore
from langchain_core.embeddings import Embeddings
from langchain_core.documents import Document
from langchain_milvus import Milvus

class SemanticVectorSearchLocalization(LocalizationStrategy):
    def __init__(self, project_path: str, embedding_model: Embeddings, milvus_uri: str):
        self.project_path = project_path
        self.embedding_model = embedding_model
        self.vector_store = Milvus(embedding_model=embedding_model, uri=milvus_uri)
        self._load_or_create_vector_store()

    def _load_or_create_vector_store(self):
        try:
            self.vector_store.load(self._get_vector_store_path())
        except FileNotFoundError:
            self.vector_store.create(self._get_vector_store_path())

    def _get_vector_store_path(self):
        return os.path.join(self.project_path, 'metadata', 'vector_store')

    def localize(self, issue: Dict[str, str], top_n: int) -> List[str]:
        query_embedding = self.embedding_model.embed(issue['description'])
        results = self.vector_store.similarity_search(query_embedding, k=top_n)
        return [result.metadata['file_path'] for result in results]

Add an Embedding Task to LLM Model Configurations:

Update TaskName in model_configuration_manager.py:

class TaskName(Enum):
    GENERATE_CODE_SUMMARY = "generate_code_summary"
    GENERATE_PACKAGE_SUMMARY = "generate_package_summary"
    GENERATE_REPO_SUMMARY = "generate_repo_summary"
    LOCALIZE = "localize"
    GENERATE_SUGGESTIONS = "generate_suggestions"
    EMBEDDING = "embedding"  # New task for embeddings

Expose a Factory Method for Fetching Embeddings in api.py:

Add a function to get embeddings model based on the provider:

from langchain_openai import OpenAIEmbeddings
from langchain_ollama import OllamaEmbeddings
from langchain_huggingface import HuggingFaceEmbeddings

def fetch_embedding_model() -> Embeddings:
    task_config = config.get_task_config(PROVIDER, TaskName.EMBEDDING)
    model_name = task_config.model_name

    if PROVIDER == "openai":
        return OpenAIEmbeddings(model=model_name)
    elif PROVIDER == "ollama":
        return OllamaEmbeddings(model=model_name)
    else:
        return HuggingFaceEmbeddings(model=model_name)

Usage Example:

When creating an instance of SemanticVectorSearchLocalization, use the factory method to fetch the embedding model:

embedding_model = fetch_embedding_model()
localization_strategy = SemanticVectorSearchLocalization(
    project_path="/path/to/project",
    embedding_model=embedding_model,
    milvus_uri="http://localhost:19530"
)

Vector Store Management:

Implement logic in the SemanticVectorSearchLocalization to handle vector store creation during project onboarding and incremental updates for existing documents.

This code structure sets up a semantic vector search-based localization strategy using Langchain abstractions and Milvus, while allowing for flexible embedding model selection based on the provider configuration.

pdhoolia / se-agent

Localization strategy based on semantic vector search #21