lebedov / python-pdfbox

Python interface to Apache PDFBox command-line tools.
Other
75 stars 22 forks source link

update python-pdfbox to support PDFBox 3.* #27

Open lebedov opened 3 years ago

lebedov commented 3 years ago

The command-line interface to PDFBox was changed in version 3.*.

fakabbir commented 1 year ago

currently a fork of python-pdfbox is available which works smoothly. pip install python-pdfbox-v2

mara004 commented 1 year ago

@fakabbir Your fork currently just pins to pdfbox 2.0.28. That is a workaround, not a solution. (FWIW, I believe porting the CLI entrypoint calls shouldn't be too difficult.)

Apart from that, removing the pdfbox download logic as seen in https://github.com/lebedov/python-pdfbox/commit/613a1a55f5abf131f525f3fc5be22513e0b8313b isn't nice, better adjust it to download from the v2 release branch only. If offline distributability is desired, you could build a wheel package bundling the jar.

mara004 commented 1 year ago

@fakabbir Also, python-pdfbox main already had a better workaround with #29, so why did you submit #32 after that? See also https://github.com/lebedov/python-pdfbox/pull/32#issuecomment-1607589987

fakabbir commented 1 year ago

@mara004 As far as I remember, #29 was not merged or working when I discovered the breaking changes due to pdfbox v3. If #29 is working now, its great and we can discard #32.

The other issue is the package looks for the jar during runtime and make it unpredictable and also network depended. To resolve that I create the fork as pdf-box-v2.

What I think the best option would be to have the following option

I am not sure @lebedov is still maintaining the project, so as a workaround only for non production high risk environment, python-pdfbox-v2 exisit.

Do you have any plans to maintain this repo in future ?

mara004 commented 1 year ago

Thank you, those are all good considerations and I think I'm on the same page.

Concerning the python-pdfbox < 3 workaround, I just tested the latest release on PyPI and it seems to work without any problems. Actually #29 just looks like a minor fixup that has not been released yet, the main code for this already existed previously.

The other issue is the package looks for the jar during runtime and make it unpredictable and also network depended. To resolve that I create the fork as pdf-box-v2.

  • An option to also download the jar file as static during pip install.
  • Migration to pdfbox-v3 should be available.

Agreed. As I hinted at above and in #10, I think it would be best to refactor the code to download pdfbox on setup and also build wheels which bundle pdfbox.

I am not sure @lebedov is still maintaining the project,

Hmm, yes, looks as if python-pdfbox development might have halted.

so as a workaround only for non production high risk environment, python-pdfbox-v2 exisit.

Ah, I see. Sorry. Yes, for the purpose of avoiding downloads on runtime, such a fork makes sense as workaround.

Do you have any plans to maintain this repo in future ?

Not this repo, but I have a weak ambition to more or less restart from scratch with setup infrastructure and a few API-based helpers. I doubt very much if I have the time, though, and in case I do, I won't be able to put in as much effort as I did for pypdfium2.

See also https://github.com/pypdfium2-team/pypdfium2/discussions/230 I've also experimented with a few gists: https://gist.github.com/mara004/51c3216a9eabd3dcbc78a86d877a61dc https://gist.github.com/mara004/881d0c5a99b8444fd5d1d21a333b70f8