Epic 7.2: Extended File Support for Flashcard Generator - Digital Guardians - Githubissues

marvelai-org / marvel-ai-backend

This is the Marvel Teaching Assistant ai repo.

MIT License

17 stars 85 forks source link

Epic 7.2: Extended File Support for Flashcard Generator - Digital Guardians #81

Open dinglunz opened 3 months ago

dinglunz commented 3 months ago

Standardized Document Loading:

Custom loaders now yield Langchain Document objects as a list, ensuring compatibility with the load_and_summarize chain.
Various sub-loaders are implemented for handling different document types such as PDFs, PowerPoint presentations, text files, JSON files, Markdown files, DOCX files, Azure documents, CSV files, HTML files, Google Sheets, and Google Slides.

Text Splitting for Effective Summarization:

Text from documents is appropriately segmented using the RecursiveCharacterTextSplitter.
The chunk size parameters are optimized for each type of document to ensure effective summarization.

Integration with Map-Reduce Algorithm:

The load_summarize_chain from Langchain is utilized with the Map Reduce algorithm to handle the summarization of multiple documents.
The chain processes both YouTube transcripts and documents from custom loaders, ensuring comprehensive summarization.

Enhanced Summarization Process:

The summarization chain is enhanced to handle large document sets by selecting the K-nearest neighbors based on semantic relevance.
Document embeddings are computed using OpenAIEmbeddings, and the most relevant documents are selected for summarization.

Evaluation of Summary Quality:

The summary quality is evaluated by monitoring the number of batches created and the length of the summary.
Edge cases, such as handling a large number of documents, are addressed by selecting the most relevant documents for summarization, preventing the Map Reduce algorithm from breaking.