llmware-ai / llmware

Unified framework for building enterprise RAG pipelines with small, specialized models
https://llmware-ai.github.io/llmware/
Apache License 2.0
5.78k stars 1.43k forks source link

Google Docs support #1022

Open doberst opened 2 weeks ago

doberst commented 2 weeks ago

LLMWare provides extensive built-in parsing capability for Microsoft Document types (PPTX, DOCX, and XLSX), but does not currently integrate a solution for parsing and integration of Google Docs, Slides and Sheets - along with potential connections into Google Drive repositories for storing and accessing documents.

It would be great to have an integrated capability that supports parsing, text chunking and ingestion of Google document types and repositories. This implementation could take several forms - from a de novo parser/text chunker in Python or C/C++ or more likely an interface into an existing Google document parser - with the supporting code to seamlessly integrate into LLMWare.

EricLiclair commented 1 day ago

@doberst seems interesting to me. can u throw some light on what do u suggest for this?

  1. any specific libs that you recommend,
  2. any existing code/class/component/pr in llmware that could be referenced/extended to add the support for GDocs I'll try and scope-in from my perspective what/where to add changes but it might be time consuming since i'm new to this codebase.

Suggestions for pt. 2 would help speedup the scoping. pt1. will help in better aligning the expected solution.