run-llama / LlamaIndexTS

Data framework for your LLM applications. Focus on server side solution
https://ts.llamaindex.ai
MIT License
1.86k stars 354 forks source link

Llamaparse error when parsing docx file #1363

Open jsmusgrave opened 4 days ago

jsmusgrave commented 4 days ago

Llamaparse parsing for docx doesn't work in 0.7.3. This works via the web UI which appears to use the public API. I had hoped 1340 would address this but it has not.

Demonstration code. (Change the file name and the api key env.)

import { LlamaParseReader } from "llamaindex";
import { ParserLanguages } from "@llamaindex/cloud/api/dist";
import fs from "fs";

type LlamaParseReaderParams = Partial<Omit<LlamaParseReader, "language" | "apiKey">>  & {
    language?: ParserLanguages | ParserLanguages[] | undefined;
    apiKey?: string | undefined;
}

async function main() {
    const path = "/tmp/somedoc.docx"
    if (!fs.existsSync(path)) {
        console.error(`File ${path} does not exist`);
        process.exit(1);
    } else {
        console.log(`File ${path} exists`);
    }

    const apiKey = process.env.LLAMAINDEX_KEY;
    const params : LlamaParseReaderParams = { 
        verbose: true,
        parsingInstruction: "Extract the text from the document a long with any images and tables.  This is a document for a course and the contents of the images are important.",
        fastMode: false,
        gpt4oMode: true,
        useVendorMultimodalModel: true,
        vendorMultimodalModelName: "anthropic-sonnet-3.5",
        // vendorMultimodalApiKey?: string | undefined;
        premiumMode: true,
        resultType: "markdown", 
        apiKey: apiKey,
        doNotCache: true,
    };

    // set up the llamaparse reader
    const reader = new LlamaParseReader(params);

    const buffer = fs.readFileSync(path);
    const documents = await reader.loadDataAsContent(new Uint8Array(buffer));

    let allText = "";
    documents.forEach(doc => {
        allText += doc.text;
    });

    console.log(allText);
  } 
  main().catch(console.error);
himself65 commented 4 days ago

Checking this

himself65 commented 4 days ago

Fixed in https://github.com/run-llama/LlamaIndexTS/pull/1364, for the workaround you have to pass filename as second parameter to make sure llamaparse know the correct file type. For now, I don't wanna detect the file type in llamaindex-ts side for performance/bundler considerration, you can use loadData(filePath) for the best compatibility

jsmusgrave commented 3 days ago

My usecase is loading data from a bucket, so I have a buffer. (Unlike my simplified example above). So I'm using loadDataAsContent.

Ex:

const documents = await reader.loadDataAsContent(new Uint8Array(buffer));

I don't believe there's a way to pass the filename to this call or to the LlamaParseReader params.

himself65 commented 3 days ago

I have to leave this ticket to llama parse side? I cannot do things more here

himself65 commented 3 days ago

/cc @hexapode

himself65 commented 3 days ago

I think I should close this, I double tested on stackbliz that now it's should working.

LlamaParse has some internal upgrade to fix this

himself65 commented 3 days ago

Please try this. If you have any more issue, please let me know

https://stackblitz.com/edit/stackblitz-starters-k137wi?file=index.js

jsmusgrave commented 3 days ago

Awesome. Thank you! With default config it works.

The Multi-modal version fails:

Got Error Code: ERROR_DURING_PROCESSING and Error Message: An unknown error occurred during processing. Job id: fee42930-5832-45d1-a9a5-4e0ba126cf9d
himself65 commented 3 days ago

could give me the parameter and maybe sample data?

jsmusgrave commented 2 days ago
import { LlamaParseReader } from "llamaindex";
import fs from "fs";
import { ParserLanguages } from "@llamaindex/cloud/api/dist";

export type LlamaParseReaderParams = Partial<Omit<LlamaParseReader, "language" | "apiKey">>  & {
    language?: ParserLanguages | ParserLanguages[] | undefined;
    apiKey?: string | undefined;
}

async function main() {
    const path = "/tmp/sample.docx";

    if (!fs.existsSync(path)) {
        console.error(`File ${path} does not exist`);
        process.exit(1);
    } else {
        console.log(`File ${path} exists`);
    }

    const apiKey = process.env.LLAMAINDEX_KEY;
    const vendorMultimodalApiKey = process.env.LI_ANTHROPIC_KEY;
    const params : LlamaParseReaderParams = { 
        verbose: true,
        parsingInstruction: "Extract the text from the document along with any details of images and tables.  This is a document for a course and a very detailed description of the contents of the images is important.",
        fastMode: false,
        gpt4oMode: false,
        useVendorMultimodalModel: true,
        vendorMultimodalModelName: "anthropic-sonnet-3.5",
        vendorMultimodalApiKey: vendorMultimodalApiKey,
        premiumMode: true,
        resultType: "markdown", 
        apiKey: apiKey,
        doNotCache: true,
    };

    // set up the llamaparse reader
    const reader = new LlamaParseReader(params);

    const buffer = fs.readFileSync(path);
    const documents = await reader.loadDataAsContent(new Uint8Array(buffer));

    let allText = "";
    documents.forEach(doc => {
        allText += doc.text;
    });

    console.log(allText);
  }

  main().catch(console.error).then((e) => {
    console.error("error", e);
  });

Using this file: https://ieeeaccess.ieee.org/wp-content/uploads/2022/01/Access-Template.docx

himself65 commented 1 day ago

i think this is same issue that docx parsed as pdf

himself65 commented 1 day ago

for now there's a workaround

const magic = [80, 75, 3, 4];
let documents
if (buffer[0] === magic[0] && buffer[1] === magic[1] && buffer[2] === magic[2] && buffer[3] === magic[3]) {
  documents = await reader.loadDataAsContent(new Uint8Array(buffer), 'filename.docx');
}