run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
1.81k stars 160 forks source link

Does Llama Parse skip reprocessing already parsed documents? T #187

Open NeevrajKB opened 1 month ago

NeevrajKB commented 1 month ago

Does Llama Parse detect documents or parts of documents that are already parsed and does not reparse them? If not, how to implement this on user's end? This would prevent unnecessary reprocessing and would save costs on enterprise end.

ggjx22 commented 1 month ago

I am also interested to know this. I have tried to make use of parsing instruction to "suppress" repeated data being shown in the output but it does not seem to be useful.

Suppose I have a document that has its header/summary section repeating in every page of the document, its going to be costly to embed them for large documents in RAG applications.

Using the below sample document as an example. Given prompts that tells the parser to not output information that are repeated between pages (e.g. the document date and number, etc.). It is just not effective. The parser sometimes is also not perfect at understanding the relationship between the parties in the document (despite having prompts that explains how to identify them).

MultiPageInvoice.pdf

This is the markdown result (not using gpt-4o):

# DOCUMENT TYPE
- Invoice

# SUPPLIER DETAILS
- NAME: Abstractors and Design Co.
- ADDRESS: Suite 8, 611 Maine St, San Francisco CA 94105
- TAX IDENTIFICATION NUMBER: N/A

# CUSTOMER DETAILS
- NAME: Ronald Davis
- ADDRESS: N/A
- TAX IDENTIFICATION NUMBER: N/A

# DELIVERY/SHIPPING LOCATION
- NAME: N/A
- ADDRESS: N/A

# BILLING ADDRESS
- Full Billing Address: Suite 8, 611 Maine St, San Francisco CA 94105

# DOCUMENT DETAILS
- DOCUMENT PROCESSING COUNTRY: N/A
- DOCUMENT DATE: 12 March 2020
- DOCUMENT NUMBER: INVOICE 00000135
- REFERENCE DOCUMENT NUMBER: F0016

# CURRENCY
- CURRENCY CODE: N/A
- CURRENCY EXCHANGE RATE: N/A

| Qty | Item | Amount |
|-----|------|--------|
| 3 | ACP101 Accounting Package | $1,350.00 |
| | Annual Subscription to Premier Version with Tax, Inventory and Payroll Plugins | |
| 4.5 | ACP101T Online Training | $495.00 |
| | Hours of Training in Premier Version - Interactive Demos with Q&A Sessions | |
| 10 | ACP101S Standard Support | $1,100.00 |
| | Initial Hours allocated for access to email and phone support for Premier Version | |
| 6 | ACP101C Screen Customization | $660.00 |
| | Hours spent customizing screens in Premier Version for client requirements | |
| 4.5 | ACP101R Report Customization | $495.00 |
| | Hours spent customizing reports in Premier Version for client requirements | |

# TOTALS
- TOTAL AMOUNT EXCLUDING TAX: N/A
- TOTAL TAX AMOUNT: N/A
- TOTAL DISCOUNT AMOUNT: N/A
- TOTAL AMOUNT INCLUDING TAX: N/A
---
# DOCUMENT TYPE
- Invoice

# DOCUMENT PROCESSING COUNTRY
- United States

# DOCUMENT DATE
- 12 March 2020

# DOCUMENT NUMBER
- 00000135

# REFERENCE DOCUMENT NUMBER
- F0016

# SUPPLIER DETAILS
- NAME: Metal Legal Finance
- ADDRESS: 154-164 The Embarcadero, San Francisco, CA 94105
- TAX IDENTIFICATION NUMBER: NA

# CUSTOMER DETAILS
- NAME: NA
- ADDRESS: NA
- TAX IDENTIFICATION NUMBER: NA

# DELIVERY/SHIPPING LOCATION
- NAME: NA
- ADDRESS: NA

# BILLING ADDRESS
- 154-164 The Embarcadero, San Francisco, CA 94105

# CURRENCY CODE
- NA

# CURRENCY EXCHANGE RATE
- NA

| Qty | Item                                      | Amount  |
|-----|-------------------------------------------|---------|
| 2   | ACP101I System Imports                    | $220.00 |
|     | Hours spent importing customer records    |         |
|     | into Premier Version                      |         |
|-----|-------------------------------------------|---------|
| 3   | ACP100 Accounting Package                 | $900.00 |
|     | Annual Subscription to Standard Version   |         |
|     | of Accounts System                        |         |
|-----|-------------------------------------------|---------|
| 4.5 | ACP100T Online Training                   | $495.00 |
|     | Hours of Training in Standard Version -   |         |
|     | Interactive Demos with Q&A Sessions       |         |
|-----|-------------------------------------------|---------|

# TOTAL AMOUNT EXCLUDING TAX
- $5,715.00

# TOTAL TAX AMOUNT
- NA

# TOTAL DISCOUNT AMOUNT
- NA

# TOTAL AMOUNT INCLUDING TAX
- $5,715.00
NeevrajKB commented 1 month ago

I'm wondering if I hosted a commercial app with this, and if the users are uploading their same files along with new files, I think that would make a huge cost incurrence on our end!

sofi444 commented 1 month ago

That's what it claims to do.

invalidate_cache=False, # If set to true, the cache will be ignored and the document re-processes. All document are kept in cache for 48hours after the job was completed to avoid processing 2 time the same document.

(False by default) I have not been able to skip already parsed documents, though, even when explicitly setting invalidate_cache=False.

Would be great to get some pointers @logan-markewich if you have the time.