run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
3.3k stars 320 forks source link

LLamaParse has STOPPED working with SCANNED PDFs #151

Closed maxqper closed 7 months ago

maxqper commented 7 months ago

Used to work before but llamaparse has stopped working with scanned pdfs.

Returns "NO CONTENT HERE"

maxqper commented 7 months ago

Used to work with multiple use-cases:

(1) scanned pdfs (2) PDFs with annotated text (3) Printed emails with pdf

all of them fail now with "NO CONTENT HERE" from pages where there is scanned content.

maxqper commented 7 months ago

See https://github.com/run-llama/llama_parse/issues/137

maxqper commented 7 months ago

This is a pretty poor regression. Our team is removing llamaParse API usage on a weekend for this change in behavior.

hexapode commented 7 months ago

I'm really sorry for that one, it seems our internal release last week broke page witch contain only an OCR image. Working on a fix right now, will let you know as soon it is fixed.

hexapode commented 7 months ago

This issue should now be fixed.

maxqper commented 7 months ago

We still see the issue. Unfortunately, cannot share the pdfs. But any scanned pdf or pdfs with embedded scanned pdfs still fails with "NO_CONTENT_HERE".

maxqper commented 7 months ago

@hexapode ^^

brianjking commented 7 months ago

Following.

shotyme commented 7 months ago

I just tested it on the playground. There are no results. Will you fix that in the future?

ggjx22 commented 6 months ago

Just tested a 27 page scanned PDF and the following are the results:

Context of the PDF:

Parser (with parsing instructions) hallucinates (truncated to not lengthen this comment):

...
---
# DOCUMENT TYPE
- Invoice

# DOCUMENT PROCESSING COUNTRY
- Australia

# DOCUMENT DATE
- 25/06/2022

# DOCUMENT NUMBER
- INV-123456

# REFERENCE DOCUMENT NUMBER
- PO-987654

# SUPPLIER DETAILS
- NAME: ABC Company Pty Ltd
- ADDRESS: 123 Supplier Street, Supplier City, Supplier Country
- TAX IDENTIFICATION NUMBER: 12345678910

# CUSTOMER DETAILS
- NAME: XYZ Corporation
- ADDRESS: 456 Customer Street, Customer City, Customer Country
- TAX IDENTIFICATION NUMBER: 10987654321

# DELIVERY/SHIPPING LOCATION
- NAME: XYZ Corporation
- ADDRESS: 456 Customer Street, Customer City, Customer Country

# BILLING ADDRESS
- 456 Customer Street, Customer City, Customer Country

# CURRENCY CODE
- AUD

# CURRENCY EXCHANGE RATE
- 1.00

| DESCRIPTION           | QUANTITY | UNIT PRICE | AMOUNT |
|-----------------------|----------|------------|--------|
| Product 1             | 2        | 100.00     | 200.00 |
| Product 2             | 1        | 50.00      | 50.00  |

# TOTALS
- TOTAL AMOUNT EXCLUDING TAX: 250.00
- TOTAL TAX AMOUNT: 25.00
- TOTAL AMOUNT INCLUDING TAX: 275.00
---
...
---
...
---
...
---

Parser (no parsing instructions) hallucinates (truncated to not lengthen this comment):


---

---

---

---

---
01  6#  105
---

---
| |258|76|
|---|---|---|
| |77| |
|66|R= 66|266|66|66|266|
---
     Eitahd  Eitshli  achC
Ja"
---

---

---
KEShi  KESM
---
LLotd  JONES  LLOYDE JONES  LLOvD E JoMES  LLOYD G JONES  LOYD 5 JC *JES  LLOYD 6 IC HES  LLOTD & IONES  LLOTORIONES
---
TS  Fss  F2t  [3
---
Wn  TmA  RCTN3:
---
E A 3:  0Wm  7W  Te MW  91 67 31  737
---

---

---
|ANEK|ANEK|ANEK|ANEK|ANEK|ANEK|ANEK|ANEK|
|---|---|---|---|---|---|---|---|
| | |Ll| | | | |071|
---
0
S        Ee_   2
 37  57   747     79  7
---
   37
0
---
# 545

Vc   77
---
STILLER  STILLER  STILLER  STILLER  STILLER  STILLER  STILLER  STILLER
---
STILLER  STILLER  STILLER  STILLER  STILLER  STLLER  STILER  STILLER
---
STILLER  STILLER  STILLER  STILLER  STILLR  SILLER  STILLER  STILLER
---
STILLER  STILLER  STILLER  STTILLER  STILLER  STILLER  STILLER  STILLER
---
|STILLER|STILLER|STILLER|STILLER|STILLER|STILLER|STILLER|STILLER|
|---|---|---|---|---|---|---|---|
| | | | | | | | |

The parsing prompt for LlamaParse directs the model to identify whether the document is an invoice or purchase order and to extract key information. It emphasizes not making up information, fully parsing large tables, and maintaining original document structure and content integrity. The extracted information should be presented in a structured markdown format, avoiding any form-related documents.

How I am using LlamaParse:

from llama_parse import LlamaParse

parser = LlamaParse(
    api_key=LLAMA_CLOUD_API_KEY,
    result_type='markdown',
    parsing_instruction='my-prompt-for-the-parser'
)
file_extractor = {'.pdf': parser}

document = SimpleDirectoryReader(
    input_dir='data/',
    file_extractor=file_extractor,
).load_data()

version: llama-parse-0.4.3

0xthierry commented 6 months ago

any updates? @hexapode

rajender07 commented 5 months ago

Failing for am as well when tested on a scanned PDF. The parser resulting 0 records. Could you please share an update if this will be resolved in near future.

Zabih-khan commented 2 months ago

We still see the issue.

image