pts / pdfsizeopt

PDF file size optimizer
GNU General Public License v2.0
750 stars 65 forks source link

Major refactorings #142

Open KOLANICH opened 4 years ago

KOLANICH commented 4 years ago

1 there are too much text in few files. The text editor I use just lags. The code should be split into different files. 2 most of cli tools should be replaced by using bindings. Calling cli tools may be insecure.

  1. There are some code related to pdf parsing. pdfminer has the code, and probably of higher quality. It should substitute as much of this tool as possible and make sense.
  2. Java tools should be used through JPype. It is faster and more secure, than using CLI
  3. 139: python 2 should be dropped and all the filenames stuff should be done via pathlib

  4. 103: pip install pdfsizeopt

  5. likely heavy refactoring is needed. I haven't dived deep, but projects with such poor code quality often have poor architecture.

I could have sent some PRs on some points but I am completely unsure if they will be merged. I have a lot of prior negative experience (the worst thing in free software ecosystem IMHO) with people just saying "I'm not interested, say anything about it and I will ban you", "I decide it is not needed", "python 3 is unneeded, I am OK with 2" and similar things and just closing PRs.

zvezdochiot commented 4 years ago

Hi @KOLANICH .

Зря ты здесь что-то спрашиваешь. Щас начнётся стандартная песня данного персонажа за бабло, пожертвования и т.д. и т.п.

Спрашивать стоит здесь: https://github.com/rbrito/pdfsizeopt . Хоть какой то отклик (нечастый, но) будет.

pts commented 4 years ago

Thank you for these observations. The state of this repo (pts/pdfsizeopt) is a result of these requirements:

Also please note:

Implementing your suggestions would be easier in a project with different requirements. It would be an honor for me if such a project inspired by pdfsizeopt was started. However, it's unlikely that I would contribute much to such a project, because I think that the requirements above are very important, and they make pdfsizeopt uniquely useful. So the little time I have to contribute will go to projects which meet the requirements above.

1 there are too much text in few files. The text editor I use just lags. The code should be split into different files.

This is a matter of preference. Some authors prefer larger small files, others prefer smaller source files. For me it would be OK to split main.py to 3, 4 or 5 files, but I would be less productive afterwards.

2 most of cli tools should be replaced by using bindings. Calling cli tools may be insecure.

This is a tradeoff. In my experience, most bindings do not work reliably, because backends tend to be incompatibly upgraded over time. If pdfsizeopt starts using bindings, they will stop working in a year, and maintainers will be busy just updating bindings, possibly introducing incompatibility with other versions of the backends. The end result is many users abandoning pdfsizeopt, because it doesn't work for them.

  1. There are some code related to pdf parsing. pdfminer has the code, and probably of higher quality. It should substitute as much of this tool as possible and make sense.

Thank you for recommending pdfminer! In pdfsizeopt, PDF parsing (in the PdfObj class) is coupled tightly with the PDF optimizations (rest of main.py). To replace PDF parsing, the entire main.py has to be rewritten, almost from scratch. Thus it's easier to start a new Python project, inspired by pdfsizeopt, but with a brand new codebase.

There are some tradeoffs implicitly encoded in the current PDF parser, for example the automatic recovery of some corrupt PDFs, and lazy parsing of objects. Reproducing all these nuances would need way too much work (possibly changing code in pdfminer), and failing to reproduce the recovery features would break the workflow of existing users.

  1. Java tools should be used through JPype. It is faster and more secure, than using CLI

Citation needed. I'm not aware of any reason why JPype is more secure than running java -jar .... I estimate that for pdfsizeopt, the Java startup speed difference is negligible (less than 0.5 second, and the Java tool would run for several seconds).

  1. python 2 should be dropped and all the filenames stuff should be done via pathlib

Let's continue this discussion in https://github.com/pts/pdfsizeopt/issues/139.

  1. 103 (pip install pdfsizeopt)

Let's continue this discussion in https://github.com/pts/pdfsizeopt/issues/103.

  1. likely heavy refactoring is needed. I haven't dived deep, but projects with such poor code quality often have poor architecture.

How much refactoring is needed depends on what your plan is with the refactored code. For some plans such as adding many new features (including lossy image optimizations, lossy vector graphics optimizations, TrueType-to-CFF conversion), pdfsizeopt would indeed benefit from heavy refactoring.

KOLANICH commented 4 years ago

Citation needed.

Just my software developing experience. In one of my projects (setuptools plugin for Kaitai Struct) where I have replaced CLI with directly tapping into the compiler (so called "JVM backend", but I don't distribute it because of license issues with KS itself) it brought great benefit for bulk processing since JVM and app initialization is done only once.

For security the same argument as with bindings.

pdfsizeopt (with all of its binary dependencies) should be easy to install

For python tools the easiest way to install them is pip - the standard package manager for all the platforms.

In my experience, most bindings do not work reliably, because backends tend to be incompatibly upgraded over time.

In my experience the most problem with bindings is that they are almost always a cext. This means a cext has to be recompiled for each python major release, and it often requires some code modification. Another pain with cexts is that they are often (pypy and graalpython have some support for cexts) CPython-specific and don't work with awesome other implementations.

To replace PDF parsing, the entire main.py has to be rewritten, almost from scratch.

Unfortunately, yes, an it is too much work for me for this project.

Reproducing all these nuances would need way too much work (possibly changing code in pdfminer), and failing to reproduce the recovery features would break the workflow of existing users.

Absolutely, and it has to be done. I guess pdfminer community may be also interested in parsing corrupt PDFs (though feels like not enough, my PR was closed on the basis I have not provided them with the PDFs, but I haven't because it would have been a copyright violation and crafting them intentionally is too much work).

pts commented 4 years ago

pdfsizeopt (with all of its binary dependencies) should be easy to install

For python tools the easiest way to install them is pip - the standard package manager for all the platforms.

Sure, it would be awesome to make pip install pdfsizeopt just work. The only hard part is to make it install the dependencies (external commands: imgdataopt (C), jbig2 (C++) and the right version of Ghostscript (C with a dozen library dependencies); maybe also gcc, libc6-dev, python3-dev) as well. I think this not possible with pip install. I'd be happy to be proven wrong, especially by a proof-of-concept, doing Ghostscript 9.10 (the hardest).

As of now, pdfsizeopt is easy to install for most users (they just have to follow the instructions in the README), and the installed software is reproducible and deterministic, because it doesn't depend on programs and libraries already installed (and their versions). I can't see a way to do this with pip install.

KOLANICH commented 4 years ago

I can't see a way to do this with pip install.

To do these the packages can be created for the binary tools. For linux we can rely on distro package manager for them. For pip, the binaries can be packed into the packages, you only need a naming scheme to distinguish them from python packages.