tmc / langchaingo

LangChain for Go, the easiest way to write LLM-based programs in Go
https://tmc.github.io/langchaingo/
MIT License
4.84k stars 632 forks source link

documentloaders: ms office docs #1068

Open Struki84 opened 3 days ago

Struki84 commented 3 days ago

I've added a document loader that will read and parse MS office file types, .doc, .docx, .xls, .xlsx, .ppt, .pptx. Turns out it's a bit more complicated than I expected so I didn't extract the texts but all the doc data and parse it in the schema.Document.PageContent.

For excel files there is metadata that will extract sheets and numerate them. The docx and pptx are just xml so I didn't extract the text just dumped the xml into PageContent, so at later date maybe somehow who understands the file formats better than me can build a decent document structure into schema.Document{}.

At the same time I think there are some advantages of llm having access to entire document structure not just the text strings.

PR Checklist