run-llama / llama_parse

Parse files for optimal RAG
https://www.llamaindex.ai
MIT License
1.84k stars 165 forks source link

Textract result in blog post #94

Open ThomasDelteil opened 3 months ago

ThomasDelteil commented 3 months ago

image

I am curious about what the red highlight mean on this picture and notably for Textract. The output of the textract API is (near)-perfect on that document, so I am wondering where the degradation might come from.

Screenshot 2024-03-18 at 16 36 14

You might want to check some of the document to text approaches we have made available through our textractor client library in case they are useful:

https://aws-samples.github.io/amazon-textract-textractor/notebooks/document_linearization_to_markdown_or_html.html https://aws-samples.github.io/amazon-textract-textractor/notebooks/textractor_for_large_language_models.html https://aws-samples.github.io/amazon-textract-textractor/notebooks/tabular_data_linearization.html

thanks,

(disclaimer: I work on Textract table recognition)

hexapode commented 3 months ago

I believe we are using https://textract.readthedocs.io/en/stable/ not AWS Textract, we should made it more clear.