[Discussion] PyPDF2 ❤️ pdfrw

MartinThoma commented 2 years ago

I've recently became the maintainer of PyPDF2. While it will take me some time to increase test coverage, merge/close PRs, deal with Github tickets (issues), I'm looking forward to a new major release in which I can deprecate some parts.

One thing I like to do is to change the interface. It starts with simple things like reader.getNumPages() to become len(reader), changing the camelCase method names to snake_case, and adding type annotations.

I was wondering how big the difference between pdfrw is and how it's related to PyPDF2. Some things look super similar. Does pdfrw have its origins as a fork of PyPDF2?

Maybe it's possible to join efforts (I could imaging merging pdfrw into PyPDF2, creating a shared "base library" for both projects, or sharing parts of the test suite between both projects)

pmaupin commented 2 years ago

Hello, Martin!

I have to admit to neglecting pdfrw lately, but it's possible that at some point I could be reenergized, and there certainly might be some room for collaboration. pdfrw used to be included in several linux distros, but then I think maybe it wasn't working with later versions of Python 3 (like maybe 3.8?) so they dropped it. Kind of snowballed from there.

pdfrw is my own original work for the most part, although: a) I think I might have taken some compression/decomp code from pypdf? Don't remember, the code should say. I certainly looked at other implementations; and b) I definitely added a couple of methods to make pdfrw compatible with pypdf for really simple things.

Having said that, and with the caveat that I haven't looked at either package in awhile, I will say that from what I remember (back when I was working on this):

1) pdfrw operates at a much lower level than pypdf, e.g. 2) pypdf is much more full featured, with, e.g. lots and lots of convenience methods, and 3) pdfrw is much faster than pypdf

If these things are still true (you could certainly try routing a couple of pdfs through each, and let me know if you have issues or what the heck the later python problem is, I dunno), and if the pypdf architecture would allow this, then it might (bearing in mind that I'm pulling all this out of my ass right now) make sense to use some of pdfrw as the underlying base of pypdf.

I probably won't have much time over the next few months, but I am always open to answering questions, and if you wanted to, for example merge the projects somehow, I don't have a real problem with that, other than that I would like the core functionality to remain clean and fast, and (from the outside looking in) it looks like a lot of code was merged into pypdf in a topsy-turvy fashion that emphasized getting something working over keeping it clean, maintainable, and fast.

Good luck and best regards, Pat

On Sat, Apr 9, 2022 at 9:15 AM Martin Thoma @.***> wrote:

I've recently became the maintainer of PyPDF2. While it will take me some time to increase test coverage, merge/close PRs, deal with Github tickets (issues), I'm looking forward to a new major release in which I can deprecate some parts.

One thing I like to do is to change the interface. It starts with simple things like reader.getNumPages() to become len(reader), changing the camelCase method names to snake_case, and adding type annotations.

I was wondering how big the difference between pdfrw is and how it's related to PyPDF2. Some things look super similar. Does pdfrw have its origins as a fork of PyPDF2?

Maybe it's possible to join efforts (I could imaging merging pdfrw into PyPDF2, creating a shared "base library" for both projects, or sharing parts of the test suite between both projects)

— Reply to this email directly, view it on GitHub https://github.com/pmaupin/pdfrw/issues/232, or unsubscribe https://github.com/notifications/unsubscribe-auth/AASE2NUT74IYWTPJHO7XVBLVEGGJVANCNFSM5S7CWLIQ . You are receiving this because you are subscribed to this thread.Message ID: @.***>

abubelinha commented 1 year ago

3) pdfrw is much faster than pypdf

I can confirm this. I was using pypdf to extract some pages of a big pdf to create smaller files.

When I found this issue I wanted to check the speed difference, and followed this pdfrw example in order to compare them: https://github.com/pmaupin/pdfrw/blob/master/examples/subset.py

I adapted it to just pass a list of page numbers to subset from a given pdf with 340 pages (EDIT: 58 MB scanned book, 175 KB/page on average). I noticed the following:

speed difference was huge: pdfrw many times faster than pypdf with my first tests ... until I realized this difference increased with number of extracted pages: pdfrw speed is not that much affected by increasing the number of pages, whereas pypdf output time (in seconds) increases a lot.

--------------------------------------------------
4 DIFFERENT PAGES (repeated N=1, 2 or 3 times):
N only affects pypdf in both time and output size
--------------------------------------------------
N=1, pageslist: [4, 6, 8, 9]
pdfrw: 1478.83 KB output size, took 1.61 seconds
pypdf: 1486.26 KB output size, took 6.684 seconds
pypdf_time / pdfrw_time = 4.15 ratio
--------------------------------------------------
N=2, pageslist: [4, 6, 8, 9, 4, 6, 8, 9]
pdfrw: 1479.90 KB output size, took 1.146 seconds
pypdf: 2972.23 KB output size, took 13.163 seconds
pypdf_time / pdfrw_time = 11.49 ratio
--------------------------------------------------
N=3, pageslist: [4, 6, 8, 9, 4, 6, 8, 9, 4, 6, 8, 9]
pdfrw: 1480.97 KB output size, took 1.127 seconds
pypdf: 4458.29 KB output size, took 19.644 seconds
pypdf_time / pdfrw_time = 17.43 ratio
--------------------------------------------------
NOW 8 DIFFERENT PAGES (repeated N=1, 2 or 3 times):
Different pages number only affects pdfrw output size, but not its speed
Pages number (no matter they are repeated or different) affects pypdf in both time and output size
--------------------------------------------------
N=1, pageslist: [4, 6, 8, 9, 193, 1, 16, 18]
pdfrw: 2774.33 KB output size, took 1.691 seconds
pypdf: 2790.06 KB output size, took 13.073 seconds
pypdf_time / pdfrw_time = 7.73 ratio
--------------------------------------------------
N=2, pageslist: [4, 6, 8, 9, 193, 1, 16, 18, 4, 6, 8, 9, 193, 1, 16, 18]
pdfrw: 2776.51 KB output size, took 1.181 seconds
pypdf: 5580.01 KB output size, took 26.387 seconds
pypdf_time / pdfrw_time = 22.34 ratio
--------------------------------------------------
N=3, pageslist: [4, 6, 8, 9, 193, 1, 16, 18, 4, 6, 8, 9, 193, 1, 16, 18, 4, 6, 8, 9, 193, 1, 16, 18]
pdfrw: 2778.69 KB output size, took 1.171 seconds
pypdf: 8369.96 KB output size, took 39.936 seconds
pypdf_time / pdfrw_time = 34.1 ratio
--------------------------------------------------

memory consumption is much bigger in pypdf; i.e., if I do N=5 this eats all my RAM (whereas pdfrw is not affected at all)
also, pdf file sizes generated by pdfrw are much smaller: particularly, repeated pages do not affect pdfrw at all, whereas pypdf multiplies the output file size and script time

@MartinThoma the difference is so amazing that I'd say there is something wrong with pypdf memory usage I tested this on Windows 7, Python 3.8

Regards @abubelinha

EDIT: a recent version of the test is posted here now https://github.com/py-pdf/benchmarks/issues/7

pmaupin / pdfrw

[Discussion] PyPDF2 ❤️ pdfrw #232