Textract result in blog post

I am curious about what the red highlight mean on this picture and notably for Textract. The output of the textract API is (near)-perfect on that document, so I am wondering where the degradation might come from.

Screenshot 2024-03-18 at 16 36 14

You might want to check some of the document to text approaches we have made available through our textractor client library in case they are useful:

https://aws-samples.github.io/amazon-textract-textractor/notebooks/document_linearization_to_markdown_or_html.html https://aws-samples.github.io/amazon-textract-textractor/notebooks/textractor_for_large_language_models.html https://aws-samples.github.io/amazon-textract-textractor/notebooks/tabular_data_linearization.html

thanks,

(disclaimer: I work on Textract table recognition)

run-llama / llama_parse

Textract result in blog post #94