langchain-ai / langchainjs

🦜🔗 Build context-aware reasoning applications 🦜🔗
https://js.langchain.com/docs/
MIT License
12.41k stars 2.1k forks source link

Issue with PPTX Loader #4000

Closed itsthemoon closed 5 months ago

itsthemoon commented 8 months ago

I am trying to use the PPTX loader (js version) inside of an AWS lambda that I've created. I am getting the following errors:

2023-12-16T15:31:51.084Z 1e4c8769-cbe8-44d2-a3f9-100dc59cb705 ERROR [OfficeParser]: Error: ENOENT: no such file or directory, mkdir 'officeParserTemp/tempfiles'

2023-12-16T15:31:51.084Z 1e4c8769-cbe8-44d2-a3f9-100dc59cb705 ERROR Error processing file: [OfficeParser]: Error: ENOENT: no such file or directory, mkdir 'officeParserTemp/tempfiles'

This is because in lambda you can only write to the /tmp folder, which works fine for every other loader I am using (docx, txt, pdf). However, the officeparser which is what the PPTX loader uses, is trying to write to a different location, which I don't have access to. Does anyone know a solution to this?

dosubot[bot] commented 8 months ago

🤖

Hey there @itsthemoon! 👋

I'm Dosu, your friendly neighborhood bot, skilled at helping with bugs, answering questions, and guiding you through the maze of contribution. I'm here to lend you a digital hand while we wait for a human maintainer to join us. Ready to dive into the code together? 🏊‍♂️💻

Let's rock this! 🎸

Based on the information provided, it seems like the parseOfficeAsync function from the officeparser module is trying to create temporary files in a directory that doesn't exist in the AWS Lambda environment. Unfortunately, the officeparser module doesn't provide an option to specify the directory where temporary files are created.

However, there might be a workaround for this issue. You could try to modify the PPTXLoader class to change the current working directory to /tmp before calling the parseOfficeAsync function, and then change it back after the function call. Here's an example of how you could do this:

import { parseOfficeAsync } from "officeparser";
import { chdir } from "process";

export class PPTXLoader extends BufferLoader {
  public async parse(
    raw: Buffer,
    metadata: Document["metadata"]
  ): Promise<Document[]> {
    const originalCwd = process.cwd();
    chdir("/tmp");

    const pptx = await parseOfficeAsync(raw, { outputErrorToConsole: true });

    chdir(originalCwd);

    if (!pptx) return [];

    return [
      new Document({
        pageContent: pptx,
        metadata,
      }),
    ];
  }
}

Please note that this is just a workaround and might not work in all cases. It's also important to handle any errors that might occur when changing the current working directory. If this doesn't solve your issue, please provide more details about your AWS Lambda setup and how you're using the PPTXLoader class.

I hope this helps! If you have any other questions, feel free to ask.

Best regards, Dosu

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

itsthemoon commented 8 months ago

@dosu-bot I dont think i can edit the PPTX loader because it is part of the langchain class that I am using.

dosubot[bot] commented 8 months ago

🤖

Based on the information provided, it appears that the PPTXLoader in the langchainjs codebase does not have a method or function that allows for the modification of the directory where temporary files are written. This is confirmed by the source code in langchain/src/document_loaders/fs/pptx.ts.

However, there might be a workaround. You could potentially modify the environment variable TMPDIR in your AWS Lambda function to point to the /tmp directory. This is because Node.js uses the os.tmpdir() method to get the default directory for temporary files, and this method returns the path of the directory from the TMPDIR environment variable.

Here's an example of how you could set this environment variable in your Lambda function:

process.env['TMPDIR'] = '/tmp';

Please note that this is a workaround and might not work in all cases. It would be best if the library provided a way to configure the temporary directory, but as per the current state of the codebase, it does not seem to be possible.

If this workaround does not solve your issue, please provide more details about your implementation and I'll do my best to assist you further.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

junonya commented 5 months ago

@itsthemoon I had no choice but to fork to solve the problem. here my coude https://github.com/junonya/officeParser/blob/6d37667ec1de7b41323334d977e5a4e3d17dbe31/officeParser.js#L22