hackerai-tech / PentestGPT

AI-Powered Automated Penetration Testing Tool
https://pentestgpt.ai/
GNU General Public License v3.0
764 stars 22 forks source link

Improve RAG with Custom Code for Embedding and Metadata Extraction #204

Open RostyslavManko opened 5 months ago

RostyslavManko commented 5 months ago

Description

The most crucial factor for HackerGPT is the quality of AI responses. To significantly improve the RAG system, we need to create custom code for text embedding and metadata extraction, such as summary, title, and keywords, which will help us enhance the RAG functionality. The embedding should be done using text-embedding-3-large with 3072 dimensions. The code should follow the best possible settings and parameters to achieve optimal results for our specific data, which consists of guides and tutorials about ethical hacking. Ensure that the best possible chunking and dividing method is used for better vectors. The code should be capable of processing md and txt files.

Assignee

@fkesheh

Objective

Our goal is to improve the RAG system by creating custom code for text embedding and metadata extraction, which will ultimately enhance the quality of AI responses.

Actions and Considerations (ACC)

  1. Research Best Practices:

    • [ ] Investigate the best possible settings, parameters, and chunking methods for text embedding and metadata extraction, specifically for our kind of data (guides and tutorials about ethical hacking).
  2. Create Custom Code for Text Embedding and Metadata Extraction:

    • [ ] Develop code that uses text-embedding-3-large with 3072 dimensions to create embeddings for our data.
    • [ ] Implement functionality to extract relevant metadata, such as summary, title, and keywords, from the guides and tutorials.
    • [ ] Use chunks of 800 tokens with an overlap of 400 tokens for better vector generation.
    • [ ] Ensure the code supports processing md and txt files.
  3. Integrate Custom Code with RAG System:

    • [ ] Update the RAG code to utilize the new text embeddings and metadata for new vectors.
  4. Testing and Quality Assurance:

    • [ ] Conduct thorough testing to ensure that the custom code works as expected and significantly improves the RAG system's performance.
    • [ ] Test various scenarios, including potential edge cases, to guarantee a robust and reliable solution.

Expected Outcomes

fkesheh commented 5 months ago

In this case are you refering to the chat with files in the interface or the Pinecone RAG?

RostyslavManko commented 4 months ago

@fkesheh Sorry, I only saw your message today. I was talking about Pinecone RAG.