run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
3.19k stars 312 forks source link

Error parsing HTML #398

Closed SuryaThiru closed 2 months ago

SuryaThiru commented 2 months ago

Describe the bug I'm working with the HTML documents, and Im' currently running into errors when trying to load it. When I try to load a simple HTML file (attached below), with this code, I get errors.

parser = LlamaParse(
    api_key=API_TOKEN,  # can also be set in your env as LLAMA_CLOUD_API_KEY
    result_type="text",  # "markdown" and "text" are available
    num_workers=6,  # if multiple files passed, split in `num_workers` API calls
    verbose=True,
    language="en",  # Optionally you can define a language, default=en
    skip_diagonal_text=True
)

parsed_docs = parser.load_data('llamaparse/test.html')
print(parsed_docs)

Files

HTML upload wasn't allowed so I'm uploading a zip

test.html.zip

Content:

<html>
    <head>
        <title>Test</title>
    </head>
    <body>
        <div>
            <div>dummy content</div>
            <h1>Hello</h1>
            <p>World</p>
            <h2>List</h2>
            <ul>
                <li>Item 1</li>
                <li>Item 2</li>
                <li>Item 3</li>
            </ul>
        </div>
    </body>
</html>

Job ID

72934284-787f-4acb-8e24-17355c80beff

Screenshots

Started parsing the file under job_id 72934284-787f-4acb-8e24-17355c80beff
Error while parsing the file 'llamaparse/test.html': Failed to parse the file: 72934284-787f-4acb-8e24-17355c80beff, status: ERROR
[]

Client:

Options See code above

BinaryBrain commented 2 months ago

Thanks for reporting the bug. It'll be fix in about 30min.