Closed maxqper closed 7 months ago
Used to work with multiple use-cases:
(1) scanned pdfs (2) PDFs with annotated text (3) Printed emails with pdf
all of them fail now with "NO CONTENT HERE" from pages where there is scanned content.
This is a pretty poor regression. Our team is removing llamaParse API usage on a weekend for this change in behavior.
I'm really sorry for that one, it seems our internal release last week broke page witch contain only an OCR image. Working on a fix right now, will let you know as soon it is fixed.
This issue should now be fixed.
We still see the issue. Unfortunately, cannot share the pdfs. But any scanned pdf or pdfs with embedded scanned pdfs still fails with "NO_CONTENT_HERE".
@hexapode ^^
Following.
I just tested it on the playground. There are no results. Will you fix that in the future?
Just tested a 27 page scanned PDF and the following are the results:
Context of the PDF:
Parser (with parsing instructions) hallucinates (truncated to not lengthen this comment):
...
---
# DOCUMENT TYPE
- Invoice
# DOCUMENT PROCESSING COUNTRY
- Australia
# DOCUMENT DATE
- 25/06/2022
# DOCUMENT NUMBER
- INV-123456
# REFERENCE DOCUMENT NUMBER
- PO-987654
# SUPPLIER DETAILS
- NAME: ABC Company Pty Ltd
- ADDRESS: 123 Supplier Street, Supplier City, Supplier Country
- TAX IDENTIFICATION NUMBER: 12345678910
# CUSTOMER DETAILS
- NAME: XYZ Corporation
- ADDRESS: 456 Customer Street, Customer City, Customer Country
- TAX IDENTIFICATION NUMBER: 10987654321
# DELIVERY/SHIPPING LOCATION
- NAME: XYZ Corporation
- ADDRESS: 456 Customer Street, Customer City, Customer Country
# BILLING ADDRESS
- 456 Customer Street, Customer City, Customer Country
# CURRENCY CODE
- AUD
# CURRENCY EXCHANGE RATE
- 1.00
| DESCRIPTION | QUANTITY | UNIT PRICE | AMOUNT |
|-----------------------|----------|------------|--------|
| Product 1 | 2 | 100.00 | 200.00 |
| Product 2 | 1 | 50.00 | 50.00 |
# TOTALS
- TOTAL AMOUNT EXCLUDING TAX: 250.00
- TOTAL TAX AMOUNT: 25.00
- TOTAL AMOUNT INCLUDING TAX: 275.00
---
...
---
...
---
...
---
Parser (no parsing instructions) hallucinates (truncated to not lengthen this comment):
---
---
---
---
---
01 6# 105
---
---
| |258|76|
|---|---|---|
| |77| |
|66|R= 66|266|66|66|266|
---
Eitahd Eitshli achC
Ja"
---
---
---
KEShi KESM
---
LLotd JONES LLOYDE JONES LLOvD E JoMES LLOYD G JONES LOYD 5 JC *JES LLOYD 6 IC HES LLOTD & IONES LLOTORIONES
---
TS Fss F2t [3
---
Wn TmA RCTN3:
---
E A 3: 0Wm 7W Te MW 91 67 31 737
---
---
---
|ANEK|ANEK|ANEK|ANEK|ANEK|ANEK|ANEK|ANEK|
|---|---|---|---|---|---|---|---|
| | |Ll| | | | |071|
---
0
S Ee_ 2
37 57 747 79 7
---
37
0
---
# 545
Vc 77
---
STILLER STILLER STILLER STILLER STILLER STILLER STILLER STILLER
---
STILLER STILLER STILLER STILLER STILLER STLLER STILER STILLER
---
STILLER STILLER STILLER STILLER STILLR SILLER STILLER STILLER
---
STILLER STILLER STILLER STTILLER STILLER STILLER STILLER STILLER
---
|STILLER|STILLER|STILLER|STILLER|STILLER|STILLER|STILLER|STILLER|
|---|---|---|---|---|---|---|---|
| | | | | | | | |
The parsing prompt for LlamaParse directs the model to identify whether the document is an invoice or purchase order and to extract key information. It emphasizes not making up information, fully parsing large tables, and maintaining original document structure and content integrity. The extracted information should be presented in a structured markdown format, avoiding any form-related documents.
How I am using LlamaParse
:
from llama_parse import LlamaParse
parser = LlamaParse(
api_key=LLAMA_CLOUD_API_KEY,
result_type='markdown',
parsing_instruction='my-prompt-for-the-parser'
)
file_extractor = {'.pdf': parser}
document = SimpleDirectoryReader(
input_dir='data/',
file_extractor=file_extractor,
).load_data()
version: llama-parse-0.4.3
any updates? @hexapode
Failing for am as well when tested on a scanned PDF. The parser resulting 0 records. Could you please share an update if this will be resolved in near future.
We still see the issue.
Used to work before but llamaparse has stopped working with scanned pdfs.
Returns "NO CONTENT HERE"