PDF Read DataSet - Githubissues

kedro-org / kedro-plugins

First-party plugins maintained by the Kedro team.

Apache License 2.0

95 stars 91 forks source link

PDF Read DataSet #188

Open yassineAlouini opened 1 year ago

yassineAlouini commented 1 year ago

Description

A dataset that can be used to read PDF documents and extract relevant parts such as tables, figures, and textual content.

Context

This could be useful for projects that need to read PDF files and extract the content within.

Possible Implementation

Various third-party libraries could be used. Some are based on OCR and some on the structure of the PDF file format. Here are some options:

Structural: https://github.com/pymupdf/PyMuPDF
OCR: https://github.com/Layout-Parser/layout-parser

astrojuanlu commented 1 year ago

Hi @yassineAlouini, thanks for your interest in Kedro! I add one more alternative to the mix, https://pypi.org/project/pypdf/

I expect that, as any scraping activity, it would be tricky to fine tune such a dataset so that it ends up extracting the desired information. I think we should make it very clear that it wouldn't do any magic, and that ultimately the user would be responsible of properly configuring the underlying library so that the results are as desired, as well as performing any validation afterwards. Does that sound reasonable?

And final question: would you like to try to contribute it? 😃

yassineAlouini commented 1 year ago

@astrojuanlu That sounds reasonable indeed and thanks for the additional library. :+1: We are discussing this internally with other colleagues at the moment and will let you know if we can contribute something. :ok_hand:

astrojuanlu commented 10 months ago

A user suggested https://github.com/Unstructured-IO/unstructured as an alternative