I am curious about what the red highlight mean on this picture and notably for Textract. The output of the textract API is (near)-perfect on that document, so I am wondering where the degradation might come from.
You might want to check some of the document to text approaches we have made available through our textractor client library in case they are useful:
I am curious about what the red highlight mean on this picture and notably for Textract. The output of the textract API is (near)-perfect on that document, so I am wondering where the degradation might come from.
You might want to check some of the document to text approaches we have made available through our textractor client library in case they are useful:
https://aws-samples.github.io/amazon-textract-textractor/notebooks/document_linearization_to_markdown_or_html.html https://aws-samples.github.io/amazon-textract-textractor/notebooks/textractor_for_large_language_models.html https://aws-samples.github.io/amazon-textract-textractor/notebooks/tabular_data_linearization.html
thanks,
(disclaimer: I work on Textract table recognition)