opendatalab / PDF-Extract-Kit

A Comprehensive Toolkit for High-Quality PDF Content Extraction
https://pdf-extract-kit.readthedocs.io/zh-cn/latest/index.html
GNU Affero General Public License v3.0
5.27k stars 357 forks source link

Refactoring of `pdf_extract.py` script #114

Open AdevGarcia opened 1 month ago

AdevGarcia commented 1 month ago

Description: This PR refactors the pdf_extract.py script to improve readability and maintainability of the code. In order not to affect the current code, the app.py script and the app_tools library have been created. app.py performs the same process as pdf_extract.py. The app_tools library incorporates the refactorings of the different steps.

app_tools |- pdf.py |- layout_analysis.py |- formula_analysis.py |- ocr_analysis.py |- table_analysis.py |- visualize.py |- config.py |- utils.py

If you find it interesting you can replace app.py with pdf_extract.py

Motivation: I love the project, I would like to thank you for the great work done. Refactoring is done to continue working to create an api with fastAPI and Docker.

Main changes:

Functionality impact: No change to existing functionality is expected, as the refactoring does not introduce new features or modify existing ones.

Instructions for Reviewers:

Example of Use:

python app.py --pdf 1706.03762.pdf