kyma-project / kyma-companion

A tool that brings AI to Kyma
Apache License 2.0
2 stars 11 forks source link

Create Script to Pull, Clean, Embed, and Store Kyma Documentation in Hana Vector Database #199

Open Teneroy opened 1 month ago

Teneroy commented 1 month ago

Description

The goal of this task is to develop a script that automates the process of pulling Kyma BTP and Kyma Open-Source documentation (in .md format), filtering for relevant documents, embedding them using a suitable model, and storing the resulting embeddings in the Hana Vector Database. The embedding model used should be carefully selected, with a suggestion to start by exploring OpenAI models, given their success in previous PoC experiments. An appropriate chunking strategy for breaking down the documentation into manageable parts must also be implemented. A plan to trigger this script will be discussed with the team for follow-up tasks.

This task can be parallelelized, 2 people can work on it and split the subtasks however they decide. (Recommendation strong)

Subtasks

  1. Pull Kyma Documentation:

    • Write a script to pull Kyma BTP and Kyma Open-Source documentation in .md format from their respective sources.
    • Ensure that the script covers all relevant documents for both BTP and Open-Source versions.
  2. Filter Relevant Documentation Files:

    • Implement logic to keep only the relevant documentation files for embedding, based on predefined criteria.
    • Define what constitutes "relevant" documents in the context of Kyma Companion’s needs (e.g., technical reference docs, API documentation, core concepts, etc.).
    • Ensure non-relevant files (e.g., examples, license files, or changelogs) are excluded from processing.
  3. Choose an Embedding Model:

    • Research and select an appropriate embedding model for converting the cleaned documentation into vector embeddings.
    • Start by evaluating OpenAI’s embedding models (used previously in PoC) and explore other alternatives if necessary.
  4. Implement Chunking Strategy:

    • Define an initial strategy for breaking down the documentation into smaller chunks to ensure effective and meaningful embeddings.
    • Test chunking strategies for both large and small documentation files to strike a balance between chunk size and relevance.
    • Use PoC experiments as a reference to guide the chunking implementation.
  5. Store Embeddings in Hana Vector Database:

    • Once the documentation is embedded, develop the logic to store the resulting embeddings in the Hana Vector Database.
    • Ensure that all relevant metadata (document title, section, source, etc.) is stored along with the embeddings for easy retrieval.
  6. Propose Triggering Mechanism:

    • As part of this task, propose an efficient method for triggering the script (e.g., manual trigger, automated based on repository changes).
    • Discuss this triggering method with the team to gather input for a follow-up task.

Subtasks

Acceptance Criteria

mfaizanse commented 2 days ago

Follow-up ticket: https://github.com/kyma-project/kyma-companion/issues/242

mfaizanse commented 2 days ago

Todo(s):