Open stefan6419846 opened 1 year ago
Hi Stefan, I am interested in volunteering to get this done too. The outline you propose looks like a good plan. Are you the owner of the pdfrw organization?
cc/ @sarnold
We should aim to get the test suite running in CI too.
Are you the owner of the pdfrw organization?
Yes, I am. For the time being, it just is a placeholder.
We should aim to get the test suite running in CI too.
It mostly does in this repository. Some tests have been disabled although, while they should be safe to enable as there is no visual difference (internal changes in reportlab
for example).
Pinging @t-houssian and @Lucas-C who might be interested in this too.
Thanks for the ping :)
@MartinThoma may also be interested by the subject: he is the maintainer of https://github.com/py-pdf/pypdf (formerly PyPDF2), and "currently helping to clean up the Python PDF ecosystem", to quote one of his recent emails 😊
Maybe the best option would be to maintain this package inside the https://github.com/py-pdf organization, which is already active?
Maybe the best option would be to maintain this package inside the https://github.com/py-pdf organization, which is already active?
I think that's a great idea. It could bring greater visibility and increased collaboration between projects.
@federicobond Thanks for the ping as well! I sadly don't have the time currently to help out much on this but do think that what @Lucas-C has mentioned is a great idea. I used this fork in a project of mine called fillpdf (https://github.com/t-houssian/fillpdf). I released this fork because it was the best I could find to use in my project.
I created that fillpdf project because of how hard it was to work in the current pdf filling libraries so I think any clean up and making things more user friendly would be awesome. Feel free as well to add fillpdf to the ecosystem and use any of the code from it.
Best of luck y'all!
Thanks for pinging me :hugs: Yes, I've spend quite some time since April 2022 in merging PyPDF2 back into pypdf + setting up CI/tests + docs + merging over 100 PRs + fixing several hundred issues. Now we have at least one other super active pypdf developer again and I hope that PyPDF3 / PyPDF4 developers and users will move back to pypdf :crossed_fingers:
It seems to me that pdfrw is solving a sub-set of the problems that pypdf is solving. For this reason I would love if the two projects (and especially the developers around them) could converge. I approached Patrick Maupin in April 2022. Sadly I don't know pdfrw well enough to really judge if merging the two would be reasonably possible.
I was thinking that we might be able to define a "pypdf-core" which is similar to pdfrw, but nobody did any work in that direction so far. I'm also uncertain about which use-cases current pdfrw users actually have. Looking at SO, I'd rather recommend them to use pypdf.
Another activity besides blog posts + answering questions is nudging the fpdf / fpdf2 people to make their relationship clear to the community. @Lucas-C and me recently received a super nice e-mail by the original author; I'm in good hope here :tada:
I'd be open to move pdfrw into the py-pdf GitHub organization. I would love an exchange between PDF-related projects / developers and sharing of issues/solutions/test cases.
The official git seems to be https://github.com/pmaupin/pdfrw / (1700 stars) whereas https://github.com/sarnold/pdfrw only has 24 stars. I'm interested in bringing the Python-PDF communities closer together, not in fracturing the communities even more. So I'd rather not move https://github.com/sarnold/pdfrw into py-pdf at this stage.
What does https://github.com/pdfrw do? I don't see anything in there.
Thank you for your input @MartinThoma, very appreciated!
It seems to me that pdfrw is solving a sub-set of the problems that pypdf is solving. For this reason I would love if the two projects (and especially the developers around them) could converge.
I agree this would be very desirable for the ecosystem.
I'm also uncertain about which use-cases current pdfrw users actually have. Looking at SO, I'd rather recommend them to use pypdf.
I can add my 2 cents here: we began using PyPDF a few years ago at our company to include a stamp on each page of some files that are uploaded to our system. Its performance was pretty bad: it took whole seconds and consumed quite a bit of memory to process moderately long files. We ended up switching to pdfrw and saw a huge improvement. This could no longer hold now, but pdfrw worked well enough for us and was easy to debug that we remained with it since.
I would love an exchange between PDF-related projects / developers and sharing of issues/solutions/test cases.
That would be awesome! Also increasing the bus factor for these projects.
The official git seems to be https://github.com/pmaupin/pdfrw / (1700 stars) whereas https://github.com/sarnold/pdfrw only has 24 stars.
I believe sarnold's fork is just pmaupin master + some small fixes/improvements, most of which we would need to land into master eventually (someone correct me if I'm wrong). Other than that, the projects haven't really diverged.
What does https://github.com/pdfrw do? I don't see anything in there.
I believe it's just @stefan6419846 squatting the name in case it was going to be used.
I would love an exchange between PDF-related projects / developers and sharing of issues/solutions/test cases.
That would be awesome! Also increasing the bus factor for these projects.
I totally agree! 😊
In fact, maybe we could consider merging https://github.com/PyFPDF (which is mostly fpdf2
) into https://github.com/py-pdf?
I'm all for joining efforts, and I'd be happy to help on other PDF libraries!
Would you be open to this @MartinThoma? This is not the main topic here, but I use the opportunity to drop this idea 😋
Also, maybe at some point the org should have a code of conduct & some projects management guidelines? I'm thinking about some basic directions on how to handle issues, reviews, releases, etc.
(edit:) I see that the only public member of the py-pdf
org is Matthew Peveler: https://github.com/orgs/py-pdf/people
You are not a member of the org @MartinThoma?
Having public org membership, and being able to know clearly who has the rights to release new versions seems important to me 😊
What does https://github.com/pdfrw do? I don't see anything in there.
I believe it's just @stefan6419846 squatting the name in case it was going to be used.
This is correct. I just created this organization to block the name when thinking about the future of the project and creating this issue as well. As responses have been quite sparse until yesterday (with my e-mails to Patrick and Steve being unanswered for nearly two months now as well), I did not yet take this further.
I am open to move this to the aforementioned py-pdf
organization nevertheless.
I was thinking that we might be able to define a "pypdf-core" which is similar to pdfrw, but nobody did any work in that direction so far. I'm also uncertain about which use-cases current pdfrw users actually have. Looking at SO, I'd rather recommend them to use pypdf.
Speaking of my use-case: I mostly use pdfrw
for working with PDF forms.
In fact, maybe we could consider merging https://github.com/PyFPDF (which is mostly fpdf2) into https://github.com/py-pdf? I'm all for joining efforts, and I'd be happy to help on other PDF libraries!
Sounds awesome to me! We should talk about permissions/expectations beforehand, though. I would suggest that you open an issue/discussion in https://github.com/PyFPDF/fpdf2 to discuss this :-)
The two roles I can give are:
I would make you @Lucas-C an owner of py-pdf, but would appreciate if we had a discussion before adding new owners (for members, I don't care too much)
Although owners have all permissions on all repositories, I would expect them/me not to interfere with them except if the repositories maintainer(s) are inactive for a long time (e.g. 3 months?) or if something security-critical happens (e.g. a dependency was introduced that is malicious/typo-squatting). As both, pypdf and fpdf are pretty big, we should write such things down within py-pdf (maybe make a github page at https://py-pdf.github.io/ )
[pypdfs] performance was pretty bad: it took whole seconds and consumed quite a bit of memory to process moderately long files. We ended up switching to pdfrw and saw a huge improvement.
I've heard that before :thinking: When I have some time I need to create benchmarks + investigate that :detective:
would make you @Lucas-C an owner of py-pdf, but would appreciate if we had a discussion before adding new owners (for members, I don't care too much)
Although owners have all permissions on all repositories, I would expect them/me not to interfere with them except if the repositories maintainer(s) are inactive for a long time (e.g. 3 months?) or if something security-critical happens (e.g. a dependency was introduced that is malicious/typo-squatting). As both, pypdf and fpdf are pretty big, we should write such things down within py-pdf (maybe make a github page at py-pdf.github.io )
Sounds great to me! 😊 I'll open this issue during week, when I have some time available.
Hey, I'm just a user, but I know how hard it it to keep a project going, so from a user perspective: do what you got to do! Also: thank you for your continued work. It is appreciated.
I'm so happy this is moving along! 😄
As for pdfrw, should we wait until @Lucas-C becomes a py-pdf owner to discuss next steps?
Hi!
I described how I plan for fpdf2
to migrate to @py-pdf in this announcement:
https://github.com/PyFPDF/fpdf2/discussions/752
I'd be happy to get feedback from you all 😊
I am not a developer, just a pypdf user.
I compared pdfrw and pypdf for extracting pages from a big pdf into smaller files.
pdfrw was the clear winner (much less time used; also better output file size optimization when stuff was repeated, I think).
Unfortunately I just posted my test output but later on modified my script and didn't keep my comparison code.
So certainly there might be errors in the way I coded my pypdf extraction test, but I think you guys might look further into this.
Thanks for your amazing job @abubelinha
This month we discovered+fixed a couple of issues that affect file size ( https://github.com/py-pdf/pypdf/pull/1926 , https://github.com/py-pdf/pypdf/pull/1906 ). If you can come up with a nice comparison script or a good test scenario, I could add it to https://github.com/py-pdf/benchmarks
I'm all for an open and fair assessment of the qualities of different libraries. This benchmark allowed us to improve the text extraction quality of pypdf a lot. Maybe we can do something similar for other workflows / operations.
edit: Recently I'm spending a less time with open source. If you make a PR to https://github.com/py-pdf/benchmarks that might help :sweat_smile:
EDIT: @MartinThoma I am not sure if your last post was an answer to mine or a general comment
As I said, I am not a developer. I do not use git, so PRs are pretty unknown to me. But I was able to remember and reproduce that test and posted the code here: https://github.com/py-pdf/benchmarks/issues/7
Thank you for clarifying and for sharing your benchmarking code. I will eventually add the idea to https://github.com/py-pdf/benchmarks . It might just take some time (and I will list you as a co-author of that PR, so you get credit for it :-) )
@t-houssian @Lucas-C @MartinThoma what is the status of moving this repo to the py-pdf org? I found a fix for one of the bugs in #17 and would like to add it to the project. I do not want to fragment pdfrw further by adding another unmaintained fork.
@sarnold is this project still maintained or archived?
@t-houssian @Lucas-C @MartinThoma what is the status of moving this repo to the py-pdf org?
Good question.
I have just moved fpdf2
to the py-pdf
GitHub org.
I'll be away for a few days, but when I'll be back I volunteer to setup a py-pdf/pdfrw
repository,
based on this fork, with maybe extra commits from https://github.com/PyFPDF/pdfrw,
and a GitHub Actions pipelines running tests.
Would you agree with this suggestion @MartinThoma & @MasterOdin?
Makes sense, happy to help get the GH action pipelines setup.
This fork already has GitHub actions set up, so this part should be relatively easy in theory.
Nevertheless, some of the tests have apparently been disabled for now and might need further evaluation: https://github.com/sarnold/pdfrw/commits/master/tests/expected.txt I did some research some months ago about the actual differences on the PDF files (related to more recent reportlab package etc.) and as far as I remember, most of the (visual) results were rather identical (I am currently on vacation and thus have no access to my experimental code).
Just for the record: Valid reference files generated by Python 3.5 (and partly Python 3.6) might be downloaded from the artifacts at https://github.com/stefan6419846/pdfrw_reference_python36/actions/
Would you agree with [having pdfrw in the py-pdf organization]?
I want a healthy Python / PDF ecosystem and I want to avoid having lots of small projects with tons of overlap.
Maintenance:
Unique Selling Point: pdfrw can make modifications to PDF files, similar to pypdf. However, pdfrw is a lot faster. Besides the speed, I don't know of a single feature that pdfrw supports which pypdf does not.
Community: As it has big overlaps with pypdf, I take it as a comparison
Maintainer support for project transfer:
Summary
I'm uncertain. I think pdfrw must have some very good ideas regarding parsing of PDFs built-in. However, I don't see a single feature that pdfrw supports and pypdf doesn't. I'm also not certain how good the community support of pdfrw/ pdfrw2 is and if we could maintain it well.
Given those first impressions, I think I'd rather try to improve pypdf with ideas from pdfrw + help the community make a switch than move pdfrw to py-pdf.
@Lucas-C Does fpdf2 use pdfrw2? If that is the case, I can see an inherent interest of you to take care of pdfrw. If you want to take care of it then, I'd be ok with it :-)
However, we should try to get some option to release a new version on PyPI. I'm currently observing how this does not work well with camelot-py :smiling_face_with_tear:
I completely forgot this: https://github.com/pmaupin/pdfrw/issues/232
If pdfrw is the basis of many other projects, I'd also say it would fit well into py-pdf.
More download starts:
https://pypistats.org/packages/pdfrw - 7% still use python 2 😱
https://pypistats.org/packages/pdfrw about 4% of python 2 users
Maintainer support for project transfer
I wrote to both Patrick and Steve in February when I initially opened this issue to get their opinion about an organization-based approach and the future maintenance in general, but never received any public or private response from them. There might be different reasons for this.
I have tried 5 days ago to contact Patrick Maupin, but didn't get a response so far. I would wait 2 weeks in total.
If somebody wants to take the work of a maintainer of pdfrw, we could do the following:
@Lucas-C Does fpdf2 use pdfrw2? If that is the case, I can see an inherent interest of you to take care of pdfrw. If you want to take care of it then, I'd be ok with it :-)
No, fpdf2
does not rely on pdfrw
.
fpdf2
just has a documentation page on how to combine both libs:
https://py-pdf.github.io/fpdf2/CombineWithPdfrw.html
All things considered, I'm not particularly interested in maintaining pdfrw
and agree that it is probably better to focus on pypdf
as a replacement.
I think I'm even going to get rid of that Combine with pdfrw
page in fpdf2
documentation.
I made a quick performance comparison between pdfrw
& pypdf
for 2 specifc use cases that we have in fpdf2
documentation:
Those are the execution times of running those scripts on my computer, using a 4.8MB base PDF document with 47 pages:
$ time ./add_on_page_with_pdfrw.py
real 0m1,649s
$ time ./add_on_page_with_pypdf.py
real 0m32,082s
$ time /add_new_page_with_pdfrw.py
real 0m2,769s
$ time ./add_new_page_with_pypdf.py
real 0m47,247s
Based on those results, pdfrw
can be 20 times faster than pypdf
for those use cases!
To me, this seems like a severe limitation of pypdf
😢
@MartinThoma: what do you think is the bottleneck here for pypdf
?
Edit: the scripts I used can be found there: https://github.com/py-pdf/fpdf2/tree/master/tutorial (they require the source & destination PDF files to be specified as arguments)
I am aware of the speed difference. I've actually already created a benchmark for it: https://github.com/py-pdf/benchmarks#watermarking-speed
Sadly, I cannot pin-point a single simple reason for that difference. I think a part of the reason is that we represent floats in with FloatObject
which (I think) might be more heavy-weight than it needs to be.
I think a part of the reason is that we represent floats in with
FloatObject
which (I think) might be more heavy-weight than it needs to be.
I spent an hour investigating the performances of pdf_benchmark.library_code.pypdf_watermarking
,
and I think the issue is probably more that pypdf
ALWAYS decode/parse content streams, and that objects are cloned "all the time". For example PageObject._merge_page()
makes repeated calls to ContentStream.__init__() that itself calls EncodedStreamObject.get_data().
Whereas pdfrw
does not bother to clone objects (cf. PageMerge.add()) and it does not parse streams (cf. https://github.com/pmaupin/pdfrw/blob/master/pdfrw/pdfreader.py#L7).
Maybe pypdf
should lazy-parse content streams? That is, only parse them if it has to access / alter those streams.
What do you think of this @MartinThoma?
Thanks for your work on this fork, which seems to be the most active and up-to-date one.
Unfortunately, GitHub makes it hard to work with forks or even discover them as they usually are hidden in the search results and in-repository search for forks is not available. Additionally, while there is a package on PyPI, it is out-of-date and does not correspond to this repository directly.
What are your plans for the future of your fork? I considered working on an own fork to keep this package available for my use cases, but with your existing work this could become easier. What I am currently thinking of: