langchain-ai / langchainjs

🦜🔗 Build context-aware reasoning applications 🦜🔗
https://js.langchain.com/docs/
MIT License
12.71k stars 2.19k forks source link

Add document loader for Powerpoint #2979

Closed YuanqianFang closed 11 months ago

YuanqianFang commented 1 year ago

Feature Request We would like to add to the PowerPoint document loader for langchain of the JavaScript version to align with the Python version.

Motivation While the Python version already supports this feature, the JavaScript variant lacks it. Adding this feature to the JavaScript version will help it align with the features of the Python version.

Contribution Our team plans to submit a PR around the end of November and would greatly appreciate any suggestions on implementing this feature.

If no one is currently working on this, Would it be possible for my team to work on this feature(issue)? Much thanks!

dosubot[bot] commented 1 year ago

🤖

Thank you for your feature request and your willingness to contribute to langchainjs! Adding a PowerPoint document loader to the JavaScript version sounds like a valuable addition to align it with the Python version.

Your proposed solution of submitting a pull request around the end of November is great! The maintainers of langchainjs will review your contribution and provide feedback if needed. Your efforts will be greatly appreciated.

If you have any questions or need any guidance during the implementation process, feel free to ask. Good luck with your contribution!


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

chrisyo-22 commented 1 year ago

Here is our breakdown of intended solution:

1. PowerPoint Loader

We aimed to provide support for both local file systems and web environments, with the goal of accepting PowerPoint presentations in .ppt and .pptx formats. To accomplish this, our attention is directed toward two primary components. The first component is the PowerPoint Loader, and is responsible for fetch data of PowerPoint to LLM and we intend to create two new files located at:

langchain/src/document_loaders/web/powerpoint.ts
langchain/src/document_loaders/fs/powerpoint.ts

The powerpoint loader will have the following structure:

class PowerPointLoader extends BufferLoader {
    constructor( filePathOrBlob,  function that load powerpoint parser, whether to split page){
        inherit from the BufferLoader
        encapsulated the new variable
    }
    load(){
        reads the buffer contents and metadata based on the type of filePathOrBlob
        calls the parse() method to parse the buffer 
        return the documents.
    }
    loadAndSplit(splitter) {
        loads the documents and splits them using a specified text splitter.
    }
    imports() {
        imports the readFile function from the fs/promises module
        throws an error indicating that the fs/promises module is not available in the current environment on failed import
    }

    parse(the buffer to be parsed, the metadata of the document.){
        load the PowerPoint from the buffer. 
        retrieves the text content and joins the text items to form the page content. 
        creates document instances for the extracted text content and metadata
        returns a promise that resolves to an array of document instances or a concatenated instance
    }
}

2. PowerPoint Parser

Another critical component we intend to develop is the PowerPoint parser and this parser is responsible for processing PowerPoint data. We plan to create a new file located at:

langchain/src/types/powerpoint-parse.ts 

In this parser, we aim to leverage the open-source tool "Unstructured" to potentially extract raw data from PowerPoint files. While "Unstructured" is a Python package and not directly usable in a JavaScript/TypeScript environment, there's a provided endpoint that our team can run locally and potenially integrate into our solution. Once we obtain the raw data, we can proceed with additional processing and fetch to the model.

3. Add Tests to PowerPoint Loader

We would also like to create the test file to test the codes under

langchain/src/document_loaders/test/powerpoint.test.ts

some of the pseudo-codes:

import ...
import { FsPowerpointLoader } from “../fs/powerpoint.ts”
import { WebPowerpointLoader } from “../web/powerpoint.ts”

test("Test Powerpoint", async () => {
     ...
     expect(...);
});
...
jacoblee93 commented 1 year ago

Would welcome this!

chrisyo-22 commented 11 months ago

As of now, we have made our pull request and have made significant progress in developing the PowerPoint Loader for the langchainjs framework, which focuses on processing .pptx formats. However, after thorough analysis and consideration, we have decided to prioritize the development of the file system loader over the web loader. This decision stems from the inherent complexities and technical limitations in handling online slide formats like Google Slides, which differ significantly from standard pptx files. We acknowledge the value of a web loader and will revisit this possibility in the future.

Nevertheless, I'd like to express my gratitude to everyone who contributed to this issue and offered suggestions! This has been an incredible learning experience for both me and my team. 😃