✨ Access to Textract - Githubissues

ministryofjustice / analytical-platform

Analytical Platform • This repository is defined and managed in Terraform

https://docs.analytical-platform.service.justice.gov.uk

MIT License

12 stars 4 forks source link

✨ Access to Textract #4877

Closed lalithanagarur closed 1 month ago

lalithanagarur commented 3 months ago

Describe the feature request.

Get access to use Textract in our webapp.

Describe the context.

BOLD and LAA want to build a document summarisation tool. We are currently in our investigative/exploratory stage. We have created a simple dev webapp and want to test access to Bedrock and Textract. Please could we get access to Textract?

llm-blold drawio

Value / Purpose

We have tested Textract during a Hackathon project and this provides the best results over other alternatives.

User Types

BOLD analysts and LAA analysts

darren1988 commented 2 months ago

To discuss be discussed at refinement. Request originally from BOLD and now data science team also now have a use case for this too

joeprinold commented 2 months ago

We are about to start a project in the Probation Data Science team that would really benefit from this functionality.

We are working in collaboration with Probation Digital to try and improve the Pre-Sentence Report writing process for operational staff. Essentially this will involve taking large sets of documents and looking for ways to summarise them or extract data from them. The first step in doing this will be extracting the text from those documents, especially in cases where the documents are hand-written. Our research indicates that open source systems perform this task much less well than Textract. The project overall aims to give significant efficiency savings for the probation staff that write these reports.

The project will kick off at the end of September with a Turing intern who will be joining to help out with this work.

Happy to provide any extra information that would be useful.

df-just commented 2 months ago

Just wanted to add support for this.

We're currently running a few proof of concepts that have made use of Textract to perform OCR on images and non-machine readable documents. It's performing much better than our previous iterations using Tesseract and the implementation was simpler too.

laura-auburn commented 2 months ago

Adding support for this too.

My team are working on a project for the Parole Board reading in lots of (currently publicly available) PDF files containing tables, images, charts etc. We wanted to start using Textract as research suggests it would perform better than some of the other standard PDF loaders we've been experimenting with (e.g. pymupdf, pdfplumber, pypdf and pdfminer). These loaders are ok(ish) for our PoC but we are seeing limitations with them already so it would be good to be able to experiment with Textract on the AP. Happy to provide more info if needed.