Open yassineAlouini opened 1 year ago
Hi @yassineAlouini, thanks for your interest in Kedro! I add one more alternative to the mix, https://pypi.org/project/pypdf/
I expect that, as any scraping activity, it would be tricky to fine tune such a dataset so that it ends up extracting the desired information. I think we should make it very clear that it wouldn't do any magic, and that ultimately the user would be responsible of properly configuring the underlying library so that the results are as desired, as well as performing any validation afterwards. Does that sound reasonable?
And final question: would you like to try to contribute it? 😃
@astrojuanlu That sounds reasonable indeed and thanks for the additional library. :+1: We are discussing this internally with other colleagues at the moment and will let you know if we can contribute something. :ok_hand:
A user suggested https://github.com/Unstructured-IO/unstructured as an alternative
Description
A dataset that can be used to read PDF documents and extract relevant parts such as tables, figures, and textual content.
Context
This could be useful for projects that need to read PDF files and extract the content within.
Possible Implementation
Various third-party libraries could be used. Some are based on OCR and some on the structure of the PDF file format. Here are some options: