Create Script to Pull, Clean, Embed, and Store Kyma Documentation in Hana Vector Database

Teneroy commented 1 month ago

Description

The goal of this task is to develop a script that automates the process of pulling Kyma BTP and Kyma Open-Source documentation (in .md format), filtering for relevant documents, embedding them using a suitable model, and storing the resulting embeddings in the Hana Vector Database. The embedding model used should be carefully selected, with a suggestion to start by exploring OpenAI models, given their success in previous PoC experiments. An appropriate chunking strategy for breaking down the documentation into manageable parts must also be implemented. A plan to trigger this script will be discussed with the team for follow-up tasks.

This task can be parallelelized, 2 people can work on it and split the subtasks however they decide. (Recommendation strong)

Subtasks

Pull Kyma Documentation:
- Write a script to pull Kyma BTP and Kyma Open-Source documentation in .md format from their respective sources.
- Ensure that the script covers all relevant documents for both BTP and Open-Source versions.
Filter Relevant Documentation Files:
- Implement logic to keep only the relevant documentation files for embedding, based on predefined criteria.
- Define what constitutes "relevant" documents in the context of Kyma Companion’s needs (e.g., technical reference docs, API documentation, core concepts, etc.).
- Ensure non-relevant files (e.g., examples, license files, or changelogs) are excluded from processing.
Choose an Embedding Model:
- Research and select an appropriate embedding model for converting the cleaned documentation into vector embeddings.
- Start by evaluating OpenAI’s embedding models (used previously in PoC) and explore other alternatives if necessary.
Implement Chunking Strategy:
- Define an initial strategy for breaking down the documentation into smaller chunks to ensure effective and meaningful embeddings.
- Test chunking strategies for both large and small documentation files to strike a balance between chunk size and relevance.
- Use PoC experiments as a reference to guide the chunking implementation.
Store Embeddings in Hana Vector Database:
- Once the documentation is embedded, develop the logic to store the resulting embeddings in the Hana Vector Database.
- Ensure that all relevant metadata (document title, section, source, etc.) is stored along with the embeddings for easy retrieval.
Propose Triggering Mechanism:
- As part of this task, propose an efficient method for triggering the script (e.g., manual trigger, automated based on repository changes).
- Discuss this triggering method with the team to gather input for a follow-up task.

Subtasks

Prepare Kyma documents. Filter and clean up the *.MD files automatically.
Choose an embedding model - Mansur
Implement chunking and store it to the Vector DB - Mansur
Come up with a automatic indexing mechanism

Acceptance Criteria

[x] The script successfully pulls Kyma BTP and Kyma Open-Source documentation in .md format.
[x] Non-relevant files are excluded, and only relevant documentation files are processed.
[x] An appropriate embedding model is selected and used to generate vector embeddings for the documentation.
[x] Documentation is chunked effectively, ensuring relevant embeddings are created.
[x] Embeddings and related metadata are stored in the Hana Vector Database.
[ ] A method for triggering the script is proposed and discussed with the team.

mfaizanse commented 2 days ago

Follow-up ticket: https://github.com/kyma-project/kyma-companion/issues/242

mfaizanse commented 2 days ago

Todo(s):

[x] Rate limit and retries
[x] More logs
[x] Default table name
[x] cleanup table in tests
[x] Tests in fetcher
[x] Update documentation for fetcher
[ ] Github actions (optional)
[ ] Follow-up
- [ ] BTP docs

kyma-project / kyma-companion