Open jsmusgrave opened 4 days ago
Checking this
Fixed in https://github.com/run-llama/LlamaIndexTS/pull/1364, for the workaround you have to pass filename
as second parameter to make sure llamaparse know the correct file type.
For now, I don't wanna detect the file type in llamaindex-ts side for performance/bundler considerration, you can use loadData(filePath)
for the best compatibility
My usecase is loading data from a bucket, so I have a buffer. (Unlike my simplified example above). So I'm using loadDataAsContent
.
Ex:
const documents = await reader.loadDataAsContent(new Uint8Array(buffer));
I don't believe there's a way to pass the filename to this call or to the LlamaParseReader
params.
I have to leave this ticket to llama parse side? I cannot do things more here
/cc @hexapode
I think I should close this, I double tested on stackbliz that now it's should working.
LlamaParse has some internal upgrade to fix this
Please try this. If you have any more issue, please let me know
https://stackblitz.com/edit/stackblitz-starters-k137wi?file=index.js
Awesome. Thank you! With default config it works.
The Multi-modal version fails:
Got Error Code: ERROR_DURING_PROCESSING and Error Message: An unknown error occurred during processing. Job id: fee42930-5832-45d1-a9a5-4e0ba126cf9d
could give me the parameter and maybe sample data?
import { LlamaParseReader } from "llamaindex";
import fs from "fs";
import { ParserLanguages } from "@llamaindex/cloud/api/dist";
export type LlamaParseReaderParams = Partial<Omit<LlamaParseReader, "language" | "apiKey">> & {
language?: ParserLanguages | ParserLanguages[] | undefined;
apiKey?: string | undefined;
}
async function main() {
const path = "/tmp/sample.docx";
if (!fs.existsSync(path)) {
console.error(`File ${path} does not exist`);
process.exit(1);
} else {
console.log(`File ${path} exists`);
}
const apiKey = process.env.LLAMAINDEX_KEY;
const vendorMultimodalApiKey = process.env.LI_ANTHROPIC_KEY;
const params : LlamaParseReaderParams = {
verbose: true,
parsingInstruction: "Extract the text from the document along with any details of images and tables. This is a document for a course and a very detailed description of the contents of the images is important.",
fastMode: false,
gpt4oMode: false,
useVendorMultimodalModel: true,
vendorMultimodalModelName: "anthropic-sonnet-3.5",
vendorMultimodalApiKey: vendorMultimodalApiKey,
premiumMode: true,
resultType: "markdown",
apiKey: apiKey,
doNotCache: true,
};
// set up the llamaparse reader
const reader = new LlamaParseReader(params);
const buffer = fs.readFileSync(path);
const documents = await reader.loadDataAsContent(new Uint8Array(buffer));
let allText = "";
documents.forEach(doc => {
allText += doc.text;
});
console.log(allText);
}
main().catch(console.error).then((e) => {
console.error("error", e);
});
Using this file: https://ieeeaccess.ieee.org/wp-content/uploads/2022/01/Access-Template.docx
i think this is same issue that docx parsed as pdf
for now there's a workaround
const magic = [80, 75, 3, 4];
let documents
if (buffer[0] === magic[0] && buffer[1] === magic[1] && buffer[2] === magic[2] && buffer[3] === magic[3]) {
documents = await reader.loadDataAsContent(new Uint8Array(buffer), 'filename.docx');
}
Llamaparse parsing for docx doesn't work in 0.7.3. This works via the web UI which appears to use the public API. I had hoped 1340 would address this but it has not.
Demonstration code. (Change the file name and the api key env.)