jennis0 / burdoc

Advanced PDF parsing for python
MIT License
4 stars 2 forks source link
document machine-learning pdf pdf-converter pdf-document-processor python


Burdoc: Advanced PDF Parsing for Python

A python library for extracting structured text, images, and tables from PDFs with context and reading order.

![PyPI - Python Version](https://img.shields.io/pypi/pyversions/burdoc) ![Build](https://img.shields.io/github/actions/workflow/status/jennis0/burdoc/python-package.yml) [![Documentation Status](https://readthedocs.org/projects/burdoc/badge/?version=latest)](https://burdoc.readthedocs.io/en/latest/?badge=latest) [![codecov](https://codecov.io/gh/jennis0/burdoc/branch/main/graph/badge.svg?token=7X7146BQ72)](https://codecov.io/gh/jennis0/burdoc) ![Issues](https://img.shields.io/github/issues/jennis0/burdoc) ![License](https://img.shields.io/github/license/jennis0/burdoc)

Table Of Contents

About the Project

Why Another PDF Parsing Library?

Excellent question! Between pdfminer, PyMuPDF, Tika, and many others there are a plethora of tools for parsing PDFs, but nearly all are focused on the initial step of pulling out raw content, not on representing the documents actual meaning. Burdoc's goal is to generate a rich semantic representation of a PDF, including headings, reading order, tables, and images that can be used for downstream processing.

Key Features

Limitations

Quickstart

More detailed information on running Burdoc can be found here - Docs

Prerequisites

ML Prerequisites

The transformer-based table detection use by Burdoc by default can be quite slow on CPU, often taking several seconds per page, you'll see a large performance increase by running it on a GPU. To avoid messing around with package versions after the fact, it's generally better to install GPU drivers and GPU accelerated versions of PyTorch first if available.

Installation

To install burdoc from pip

pip install burdoc

To build it directly from source

git clone https://github.com/jennis0/burdoc
cd burdoc
pip install .

Developer Install

To reproduce the development environment for running builds, tests, etc. use

pip install burdoc[dev]

or

git clone https://github.com/jennis0/burdoc
cd burdoc
pip install -e ".[dev]"

Usage

Burdoc can be used as a library or directly from the command line depending on your usecase.

Command Line

usage: burdoc [-h] [--pages PAGES] [--html] [--detailed] [--no-ml-tables] [--images] [--single-threaded] [--profile] [--debug] in_file [out_file]

positional arguments:
  in_file            Path to the PDF file you want to parse
  out_file           Path to file to write output to. Defaults to [in-file-stem].json/[in-file-stem].html

optional arguments:
  -h, --help         show this help message and exit
  --pages PAGES      List of pages to process. Accepts comma separated list and ranges specified with '-'
  --html             Output a simple HTML representation of the document, rather than the JSON content.
  --detailed         Include BoundingBoxes and font statistics in the output to aid onward processing
  --no-ml-tables     Turn off ML table finding. Defaults to False.
  --images           Extract images from PDF and store in output. This can lead to very large output JSON files.Default is False
  --single-threaded  Force Burdoc to run in single-threaded mode. Default to off
  --profile          Dump timing information at end of processing
  --debug            Dump debug messages to log

Library

from burdoc import BurdocParser

parser = BurdocParser(
 detailed: bool = False, # Include detailed information such as font statistics and bounding boxes in the output
 skip_ml_table_finding: bool = False, # Whether to use ML table finding algorithms
 ignore_images: bool = False, # Don’t extract any images from the document. Much faster but prone to errors if images used as layout elements 
 max_threads: int | None = None, # Maximum number of threads to run. Set to None to use default system limits or 1 to force single-threaded mode. Defaults to None 
 log_level: int = 20, #  Defaults to logging.INFO 
 show_pages: bool = False # Draw each page as it’s extracted with extraction information laid on top. Primarily for debugging. Defaults to False.
)
content = parser.read('file.pdf')

Roadmap

Current issues I'd like to address are:

Built With

Contributing

Contributions are what make the open source community such an amazing place to be learn, inspire, and create. Any contributions you make are greatly appreciated.

Creating A Pull Request

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Authors

Acknowledgements