pypdfium2-team / pypdfium2

Python bindings to PDFium
https://pypdfium2.readthedocs.io/
419 stars 17 forks source link
pdf pdf-documents pdf-to-image pdfium python rasterisation

pypdfium2

Downloads

pypdfium2 is an ABI-level Python 3 binding to PDFium, a powerful and liberal-licensed library for PDF rendering, inspection, manipulation and creation.

It is built with ctypesgen and external PDFium binaries. The custom setup infrastructure provides a seamless packaging and installation process. A wide range of platforms is supported with pre-built packages.

pypdfium2 includes helpers to simplify common use cases, while the raw PDFium/ctypes API remains accessible as well.

Installation

Runtime Dependencies

As of this writing, pypdfium2 does not need any mandatory runtime dependencies apart from Python itself.

However, some optional support model features require additional packages:

Setup Magic

As pypdfium2 requires a C extension and has custom setup code, there are some special features to consider. Note, the APIs below may change any time and are mostly of internal interest.

[^platform_ids]: Intended for packaging, so that wheels can be crafted for any platform without access to a native host.

Usage

Support model

Here are some examples of using the support model API.

Raw PDFium API

While helper classes conveniently wrap the raw PDFium API, it may still be accessed directly and is available in the namespace pypdfium2.raw. Lower-level helpers that may aid with using the raw API are provided in pypdfium2.internal.

import pypdfium2.raw as pdfium_c
import pypdfium2.internal as pdfium_i

Since PDFium is a large library, many components are not covered by helpers yet. You may seamlessly interact with the raw API while still using helpers where available. When used as ctypes function parameter, helper objects automatically resolve to the underlying raw object (but you may still access it explicitly if desired):

permission_flags = pdfium_c.FPDF_GetDocPermission(pdf.raw)  # explicit
permission_flags = pdfium_c.FPDF_GetDocPermission(pdf)      # implicit

For PDFium docs, please look at the comments in its public header files.[^pdfium_docs] A large variety of examples on how to interface with the raw API using ctypes is already provided with support model source code. Nonetheless, the following guide may be helpful to get started with the raw API, especially for developers who are not familiar with ctypes yet.

[^pdfium_docs]: Unfortunately, no recent HTML-rendered docs are available for PDFium at the moment.

[^bindings_decl]: From the auto-generated bindings file. We maintain a reference copy at autorelease/bindings.py. Or if you have an editable install, there will also be src/pypdfium2_raw/bindings.py.

Command-line Interface

pypdfium2 also ships with a simple command-line interface, providing access to key features of the support model in a shell environment (e. g. rendering, content extraction, document inspection, page rearranging, ...).

The primary motivation behind this is to have a nice testing interface, but it may be helpful in a variety of other situations as well. Usage should be largely self-explanatory, assuming a minimum of familiarity with the command-line.

Licensing

Important: This is NOT LEGAL ADVICE, and there is ABSOLUTELY NO WARRANTY for any information provided in this document or elsewhere in the pypdfium2 project, including earlier revisions.

pypdfium2 itself is available by the terms and conditions of Apache-2.0 / BSD-3-Clause. Documentation and examples of pypdfium2 are licensed under CC-BY-4.0.

PDFium is available under a BSD-style license that can be found in its LICENSE file. Various other open-source licenses apply to dependencies bundled with PDFium. These also have to be shipped alongside binary redistributions. Copies of identified licenses are provided in LicenseRef-PdfiumThirdParty.txt. There is no guarantee of completeness, and pdfium's dependencies might change over time. Please do notify us if you think this misses a relevant license.

pypdfium2 includes SPDX headers in source files. License information for data files is provided in .reuse/dep5 as per the reuse standard.

To the author's knowledge, pypdfium2 is one of the rare Python libraries that are capable of PDF rendering while not being covered by copyleft licenses (such as the GPL).[^liberal_pdf_renderlibs]

[^liberal_pdf_renderlibs]: The only other liberal-licensed PDF rendering libraries known to the author are pdf.js (JavaScript) and Apache PDFBox (Java), but python bindings packages don't exist yet or are unsatisfactory. However, we wrote some gists that show it'd be possible in principle: pdfbox (+ setup), pdfjs.

Issues

While using pypdfium2, you might encounter bugs or missing features. In this case, feel free to open an issue or discuss thread. If applicable, include details such as tracebacks, OS and CPU type, as well as the versions of pypdfium2 and used dependencies. However, please note our response policy.

Roadmap:

Known limitations

Incompatibility with CPython 3.7.6 and 3.8.1

pypdfium2 built with mainstream ctypesgen cannot be used with releases 3.7.6 and 3.8.1 of the CPython interpreter due to a regression that broke ctypesgen-created string handling code.

Since version 4, pypdfium2 is built with a patched fork of ctypesgen that removes ctypesgen's problematic string code.

Risk of unknown object lifetime violations

As outlined in the raw API section, it is essential that Python-managed resources remain available as long as they are needed by PDFium.

The problem is that the Python interpreter may garbage collect objects with reference count zero at any time, so an unreferenced but still required object may either by chance stay around long enough or disappear too soon, resulting in non-deterministic memory issues that are hard to debug. If the timeframe between reaching reference count zero and removal is sufficiently large and roughly consistent across different runs, it is even possible that mistakes regarding object lifetime remain unnoticed for a long time.

Although we intend to develop helpers carefully, it cannot be fully excluded that unknown object lifetime violations are still lurking around somewhere, especially if unexpected requirements were not documented by the time the code was written.

Missing raw PDF access

As of this writing, PDFium's public interface does not provide access to the raw PDF data structure (see issue 1694). It does not expose APIs to read/write PDF dictionaries, streams, name/number trees, etc. Instead, it merely offers a predefined set of abstracted functions. This considerably limits the library's potential, compared to other products such as pikepdf.

Limitations of ABI bindings

PDFium's non-public backend would provide extended capabilities, including raw access, but it is not exported into the ABI and written in C++ (not pure C), so we cannot use it with ctypes. This means it's out of scope for this project.

Also, while ABI bindings tend to be more convenient, they have some technical drawbacks compared to API bindings (see e.g. 1, 2)

Development

Contributions

We may accept contributions, but only if our code quality expectations are met.

Policy:

Long lines

The pypdfium2 codebase does not hard wrap long lines. It is recommended to set up automatic word wrap in your text editor, e.g. VS Code:

editor.wordWrap = bounded
editor.wordWrapColumn = 100

Docs

pypdfium2 provides API documentation using Sphinx, which can be rendered to various formats, including HTML:

sphinx-build -b html ./docs/source ./docs/build/html/
./run build  # short alias

Built docs are primarily hosted on readthedocs.org. It may be configured using a .readthedocs.yaml file (see instructions), and the administration page on the web interface. RTD supports hosting multiple versions, so we currently have one linked to the main branch and another to stable. New builds are automatically triggered by a webhook whenever you push to a linked branch.

Additionally, one doc build can also be hosted on GitHub Pages. It is implemented with a CI workflow, which is supposed to be triggered automatically on release. This provides us with full control over the build env and the used commands, whereas RTD may be less liberal in this regard.

Testing

pypdfium2 contains a small test suite to verify the library's functionality. It is written with pytest:

./run test

Note that ...

To get code coverage statistics, you may call

./run coverage

Sometimes, it can also be helpful to test code on many PDFs.[^testing_corpora] In this case, the command-line interface and find come in handy:

# Example A: Analyse PDF images (in the current working directory)
find . -name '*.pdf' -exec bash -c "echo \"{}\" && pypdfium2 pageobjects \"{}\" --filter image" \;
# Example B: Parse PDF table of contents
find . -name '*.pdf' -exec bash -c "echo \"{}\" && pypdfium2 toc \"{}\"" \;

[^testing_corpora]: For instance, one could use the testing corpora of open-source PDF libraries (pdfium, pikepdf/ocrmypdf, mupdf/ghostscript, tika/pdfbox, pdfjs, ...)

Release workflow

The release process is fully automated using Python scripts and scheduled release workflows. You may also trigger the workflow manually using the GitHub Actions panel or the gh command-line tool.

Python release scripts are located in the folder setupsrc/pypdfium2_setup, along with custom setup code:

The autorelease script has some peculiarities maintainers should know about:

In case of necessity, you may also forego autorelease/CI and do the release manually, which will roughly work like this (though ideally it should never be needed):

If something went wrong with commit or tag, you can still revert the changes:

# perform an interactive rebase to change history (substitute $N_COMMITS with the number of commits to drop or modify)
git rebase -i HEAD~$N_COMMITS
git push --force
# delete local tag (substitute $TAGNAME accordingly)
git tag -d $TAGNAME
# delete remote tag
git push --delete origin $TAGNAME

Faulty PyPI releases may be yanked using the web interface.

Prominent Embedders

pypdfium2 is used by prominent embedders such as langchain, nougat, pdfplumber, and doctr.

This results in pypdfium2 being part of a large dependency tree.

Thanks to[^thanks_to]

... and further code contributors (GitHub stats).

If you have somehow contributed to this project but we forgot to mention you here, please let us know.

[^thanks_to]: People listed in this section may not necessarily have contributed any copyrightable code to the repository. Some have rather helped with ideas, or contributions to dependencies of pypdfium2.

History

PDFium

The PDFium code base was originally developed as part of the commercial Foxit SDK, before being acquired and open-sourced by Google, who maintain PDFium independently ever since, while Foxit continue to develop their SDK closed-source.

pypdfium2

pypdfium2 is the successor of pypdfium and pypdfium-reboot.

Inspired by wowpng, the first known proof of concept Python binding to PDFium using ctypesgen, the initial pypdfium package was created. It had to be updated manually, which did not happen frequently. There were no platform-specific wheels, but only a single wheel that contained binaries for 64-bit Linux, Windows and macOS.

pypdfium-reboot then added a script to automate binary deployment and bindings generation to simplify regular updates. However, it was still not platform specific.

pypdfium2 is a full rewrite of pypdfium-reboot to build platform-specific wheels and consolidate the setup scripts. Further additions include ...