The goal of this task is to develop a script that automates the process of pulling Kyma BTP and Kyma Open-Source documentation (in .md format), filtering for relevant documents, embedding them using a suitable model, and storing the resulting embeddings in the Hana Vector Database. The embedding model used should be carefully selected, with a suggestion to start by exploring OpenAI models, given their success in previous PoC experiments. An appropriate chunking strategy for breaking down the documentation into manageable parts must also be implemented. A plan to trigger this script will be discussed with the team for follow-up tasks.
This task can be parallelelized, 2 people can work on it and split the subtasks however they decide. (Recommendation strong)
Subtasks
Pull Kyma Documentation:
Write a script to pull Kyma BTP and Kyma Open-Source documentation in .md format from their respective sources.
Ensure that the script covers all relevant documents for both BTP and Open-Source versions.
Filter Relevant Documentation Files:
Implement logic to keep only the relevant documentation files for embedding, based on predefined criteria.
Define what constitutes "relevant" documents in the context of Kyma Companion’s needs (e.g., technical reference docs, API documentation, core concepts, etc.).
Ensure non-relevant files (e.g., examples, license files, or changelogs) are excluded from processing.
Choose an Embedding Model:
Research and select an appropriate embedding model for converting the cleaned documentation into vector embeddings.
Start by evaluating OpenAI’s embedding models (used previously in PoC) and explore other alternatives if necessary.
Implement Chunking Strategy:
Define an initial strategy for breaking down the documentation into smaller chunks to ensure effective and meaningful embeddings.
Test chunking strategies for both large and small documentation files to strike a balance between chunk size and relevance.
Use PoC experiments as a reference to guide the chunking implementation.
Store Embeddings in Hana Vector Database:
Once the documentation is embedded, develop the logic to store the resulting embeddings in the Hana Vector Database.
Ensure that all relevant metadata (document title, section, source, etc.) is stored along with the embeddings for easy retrieval.
Propose Triggering Mechanism:
As part of this task, propose an efficient method for triggering the script (e.g., manual trigger, automated based on repository changes).
Discuss this triggering method with the team to gather input for a follow-up task.
Subtasks
Prepare Kyma documents. Filter and clean up the *.MD files automatically.
Choose an embedding model - Mansur
Implement chunking and store it to the Vector DB - Mansur
Come up with a automatic indexing mechanism
Acceptance Criteria
[x] The script successfully pulls Kyma BTP and Kyma Open-Source documentation in .md format.
[x] Non-relevant files are excluded, and only relevant documentation files are processed.
[x] An appropriate embedding model is selected and used to generate vector embeddings for the documentation.
[x] Documentation is chunked effectively, ensuring relevant embeddings are created.
[x] Embeddings and related metadata are stored in the Hana Vector Database.
[ ] A method for triggering the script is proposed and discussed with the team.
Description
The goal of this task is to develop a script that automates the process of pulling Kyma BTP and Kyma Open-Source documentation (in
.md
format), filtering for relevant documents, embedding them using a suitable model, and storing the resulting embeddings in the Hana Vector Database. The embedding model used should be carefully selected, with a suggestion to start by exploring OpenAI models, given their success in previous PoC experiments. An appropriate chunking strategy for breaking down the documentation into manageable parts must also be implemented. A plan to trigger this script will be discussed with the team for follow-up tasks.This task can be parallelelized, 2 people can work on it and split the subtasks however they decide. (Recommendation strong)
Subtasks
Pull Kyma Documentation:
.md
format from their respective sources.Filter Relevant Documentation Files:
Choose an Embedding Model:
Implement Chunking Strategy:
Store Embeddings in Hana Vector Database:
Propose Triggering Mechanism:
Subtasks
Acceptance Criteria
.md
format.