pypdfium2-team / pypdfium2

Python bindings to PDFium
https://pypdfium2.readthedocs.io/
425 stars 17 forks source link

Expanded examples of working with the library beyond rendering to images? #49

Closed matthewlenz closed 2 years ago

matthewlenz commented 2 years ago

I deal with a lot of PDFs of varying quality and from what I can see PDFium/pypdfium2 seems to handle PDFs that other open source renderers (ghostscript, poppler, etc) cannot. I've been looking at the old Foxit SDK docs and it appears that (assuming it wasn't removed from the project when google purchased it) it's possible to use PDFium to do things like import pages from one or more PDFs into a new PDF (PDF merging) and import images into a new PDF. Are you able confirm if that is possible with pypdfium2 and if you would be interest in receiving a bounty to provide examples in pypdfium2? I'd much rather contribute funds to improve your project than seek out a close source/commercial solution.

adam-huganir commented 2 years ago

@matthewlenz In theory what could be added to this library could be almost limitless within the restraints of what chrome can do, so definitely a lot of room for new features.

@mara004 Not sure what your time commitment is for this project, I know I definitely won't have time to work on big features most likely, but I would love to use them :)

matthewlenz commented 2 years ago

Even lower level api examples like the "Using the PDFium API" example in the README.md would meet my needs. I'm not looking for new helpers (ie render_pdf, render_page, etc).

mara004 commented 2 years ago

For some background, my primary use case and the reason why I maintain this library is rendering PDFs (and reading the table of contents). However, you are right PDFium has a lot more capabilities, but I haven't really looked into that yet.

If you wish to see more examples of low-level API use, I would suggest you to take a look at the source code of pdfbrain, kuafu and Extract-URLs. These are applications that access the low-level PDFium API with ctypes, though they are still using the predecessor of this project. Someone also reported to me that they search text in PDFs using PyPDFium2, for instance.

In theory, you should be able to do anything with these bindings that PDFium can do, but ctypes may be quite cumbersome, so functions with complex arguments or callbacks are probably hard to use. I haven't seen an example of merging PDFs with PyPDFium yet, but it may be possible. If you can point me at the corresponding section in the PDFium documentation, or better an example of use in C, I'll experiment with it. As far as I can tell, Google haven't removed any functionality compared to the PDFium variant distributed by Foxit.

However, I'd like to note that libraries such as pikepdf (or PyMuPDF, if you don't mind the AGPL) may be more suitable and easier to use for content transformation tasks like merging PDFs. I'm working with pikepdf privately and can highly recommend it as a well-maintained, feature rich, stable and widely used library with a pythonic API.

Concerning funding, I'm working on this only as a hobby project. Currently, I rather need time and interest than money.

Regarding your issues with ghostscript and poppler, this sounds like problems with specific PDFs. I'd recommend you to file bug reports with these projects in case you haven't done so yet. For almost any PDF library, thare are a few document structures that cause issues. PDFium is affected by that as well - there are plenty of examples in its bug tracker.

mara004 commented 2 years ago

I searched the PDFium source code a bit, and it looks like there are some functions for merging PDFs defined in fpdf_ppo.h, e. g. FPDF_ImportPagesByIndex(). The most complex part will be saving the document, though. There is a function FPDF_SaveAsCopy(), but it requires an interface for file writing access, which means you'd have to set up an FPDF_FILEWRITE structure and implement a callback function...

matthewlenz commented 2 years ago

Thanks for the feedback. I'll give pikepdf a try to see what kind of results it can produce with my sample data.

mara004 commented 2 years ago

For what it's worth, I've just created a basic example on how to merge PDFs with pypdfium2. I have no idea what I'm doing but it appears to work. I'll do some more testing tomorrow and then add it to the examples.

#! /usr/bin/env python3
# SPDX-FileCopyrightText: 2022 geisserml <geisserml@gmail.com>
# SPDX-License-Identifier: Apache-2.0 OR BSD-3-Clause

import ctypes
import argparse
import pypdfium2 as pdfium

def _merge_pdfs(pdf_documents):

    dest_doc = pdfium.FPDF_CreateNewDocument()

    for src_doc in reversed(pdf_documents):
        page_count = pdfium.FPDF_GetPageCount(src_doc)
        IntArray = ctypes.c_int * page_count
        page_indices = IntArray(*[i for i in range(page_count)])
        pdfium.FPDF_ImportPagesByIndex(dest_doc, src_doc, page_indices, page_count, 0)

    return dest_doc

class _WriteBlock:

    def __init__(self, file_handle):
        self.file_handle = file_handle

    def __call__(self, _filewrite, data, size):
        buffer = ctypes.cast(data, ctypes.POINTER(ctypes.c_ubyte * size))
        self.file_handle.write(buffer.contents)
        return 1

def _save_pdf(pdf_document, output_path):

    with open(output_path, 'wb') as file_handle:

        filewrite = pdfium.FPDF_FILEWRITE()
        WriteFunctionType = ctypes.CFUNCTYPE(pdfium.FPDF_BOOL, ctypes.POINTER(pdfium.FPDF_FILEWRITE), ctypes.POINTER(None), ctypes.c_ulong)
        filewrite.WriteBlock = WriteFunctionType(_WriteBlock(file_handle))

        pdfium.FPDF_SaveAsCopy(pdf_document, ctypes.byref(filewrite), pdfium.FPDF_INCREMENTAL)

def parse_args():
    parser = argparse.ArgumentParser(
        description = "Merge PDF files with PyPDFium2.",
    )
    parser.add_argument(
        'input_paths',
        nargs = '+',
    )
    parser.add_argument(
        '--output-path', '-o',
        required = True,
    )
    return parser.parse_args()

def main(input_paths, output_path):
    pdfs = [pdfium.open_pdf(inp) for inp in input_paths]
    merged_doc = _merge_pdfs(pdfs)
    _save_pdf(merged_doc, output_path)
    for pdf in pdfs: pdfium.close_pdf(pdf)

if __name__ == '__main__':
    args = parse_args()
    main(
        input_paths = args.input_paths,
        output_path = args.output_path,
    )
mara004 commented 2 years ago

Now I come to think about it, I'd actually like to add this as a support model, at least the PDF saving part. Thanks for motivating me to look into this.

mara004 commented 2 years ago

25a54b8

mara004 commented 2 years ago

I also added an example for n-up compositing to the tests that might be interesting:


def test_save_pdf_tofile():

    src_pdf = pdfium.open_pdf(TestFiles.multipage)

    # perform n-up compositing
    dest_pdf = pdfium.FPDF_ImportNPagesToOne(
        src_pdf,
        ctypes.c_float(1190),  # width
        ctypes.c_float(1684),  # height
        ctypes.c_size_t(2),    # number of horizontal pages
        ctypes.c_size_t(2),    # number of vertical pages
    )

    output_path = join(OutputDir,'n-up.pdf')
    with open(output_path, 'wb') as file_handle:
        pdfium.save_pdf(dest_pdf, file_handle)

    for pdf in (src_pdf, dest_pdf):
        pdfium.close_pdf(pdf)

    assert os.path.isfile(output_path)
mara004 commented 2 years ago

@matthewlenz Not sure if you are still interested in pypdfium2, but I recently rewrote the Readme. It now contains a new section on how to use the raw API, listing all common tasks (arrays, pointers, casting, string buffers, callbacks, data transfer, ...) and things one needs to be careful about (object lifetime). Also, the support model is now much better and more extensive than it was back in January, so its source code also provides a lot of examples.

matthewlenz commented 2 years ago

Very cool. Thanks!

mara004 commented 1 year ago

Just writing done my thoughts because it once came up in this thread: My personal situation has changed slightly, and I am now considering to take donations. However, I'm not quite sure how to set that up, so it may take some time.