Description:
This PR refactors the pdf_extract.py script to improve readability and maintainability of the code.
In order not to affect the current code, the app.py script and the app_tools library have been created.
app.py performs the same process as pdf_extract.py.
The app_tools library incorporates the refactorings of the different steps.
If you find it interesting you can replace app.py with pdf_extract.py
Motivation:
I love the project, I would like to thank you for the great work done.
Refactoring is done to continue working to create an api with fastAPI and Docker.
Main changes:
The script app.py has been created with the pipeline of pdf_extract.py.
The library app_tools has been created that contains the classes and methods to perform each step of the pipeline.
pdf.py: Provides a set of app_tools for working with PDF files.
layout_analysis.py: Analyzes the layout of documents by detecting the layout of each page in a document image.
formula_analysis.py: Is designed to handle formula detection and recognition in images.
ocr_analysis.py: OCR Processor. It is responsible for performing OCR recognition.
table_analysis.py: Represents a Table Processor that is used for table recognition in documents.
visualize.py: It generates visualizations of the document layout
config.py: Configure model parameters and logs
utils.py: save results in json
Functionality impact: No change to existing functionality is expected, as the refactoring does not introduce new features or modify existing ones.
Instructions for Reviewers:
Review the app.py and app_tools scripts to ensure that the logic has been ported correctly.
Verifies that there are no observable changes in the system's behavior when running the tests.
Description: This PR refactors the
pdf_extract.py
script to improve readability and maintainability of the code. In order not to affect the current code, theapp.py
script and theapp_tools
library have been created.app.py
performs the same process aspdf_extract.py
. Theapp_tools
library incorporates the refactorings of the different steps.app_tools |- pdf.py |- layout_analysis.py |- formula_analysis.py |- ocr_analysis.py |- table_analysis.py |- visualize.py |- config.py |- utils.py
If you find it interesting you can replace
app.py
withpdf_extract.py
Motivation: I love the project, I would like to thank you for the great work done. Refactoring is done to continue working to create an api with fastAPI and Docker.
Main changes:
app.py
has been created with the pipeline ofpdf_extract.py
.app_tools
has been created that contains the classes and methods to perform each step of the pipeline.pdf.py
: Provides a set of app_tools for working with PDF files.layout_analysis.py
: Analyzes the layout of documents by detecting the layout of each page in a document image.formula_analysis.py
: Is designed to handle formula detection and recognition in images.ocr_analysis.py
: OCR Processor. It is responsible for performing OCR recognition.table_analysis.py
: Represents a Table Processor that is used for table recognition in documents.visualize.py
: It generates visualizations of the document layoutconfig.py
: Configure model parameters and logsutils.py
: save results in jsonFunctionality impact: No change to existing functionality is expected, as the refactoring does not introduce new features or modify existing ones.
Instructions for Reviewers:
app.py
andapp_tools
scripts to ensure that the logic has been ported correctly.Example of Use: