pymupdf / PyMuPDF

PyMuPDF is a high performance Python library for data extraction, analysis, conversion & manipulation of PDF (and other) documents.
https://pymupdf.readthedocs.io
GNU Affero General Public License v3.0
5.8k stars 536 forks source link

mupdf: object out of range (0 0 R) #990

Closed fgallaire closed 3 years ago

fgallaire commented 3 years ago

Can't save a file. Error:

mupdf: object out of range (0 0 R); xref size 946

Warnigs:

first object in xref is not free
... repeated 2 times...
object out of range (0 0 R); xref size 946
JorjMcKie commented 3 years ago

Please:

fgallaire commented 3 years ago

@JorjMcKie file too big for GitHub, please download it here: https://store8.gofile.io/download/q4602U/213cb979c2fb48c50b33369556d016c5/xref_problem.pdf

JorjMcKie commented 3 years ago

please download it here:

I won't do that until you provide all mandatory information for a bug. Just providing a giant file (which presumably has structure issues) is not enough. You obviously still did not read the issue template for bug submission. What is needed for accepting a bug:

fgallaire commented 3 years ago

@JorjMcKie I like the fact you are very responsive, but ISTM you are far too stiff about process. I provide a faulty file making the bug reproductible, which is clearly a plus for the project, and you say it is too "giant" for you... Are you serious about a 30 MB, not 30 GB, pdf ? I don't give you the system information because my understanding of the problem tells me it's not a platform dependent but a file related problem. A reproducer script ? Can't save a file. i.e. open the file then try to save it, with PyMuPDF which I suspected you know how to use, do you really think the wording is ambiguous ?

JorjMcKie commented 3 years ago

Hm, I am German, so I like rules, hahaha 😎

do you really think the wording is ambiguous ?

Definitely yes! BTW: I never said, the file is too big to download - in fact I nevertheless did download it, in the meantime.

Here is my argument, and I do hope you will understand my point of view:

  1. If you enter a bug, then this implies you assume, that PyMuPDF software or documentation does not behave as announced.
  2. So, to confirm your proposition, I do need all material to reproduce the error you observed. This must be more than just a file - whether broken or not. If broken, then PyMuPDF might still not react adequately - depending on how it was used. You cannot expect me to guess. After all, broken PDFs are daily business!

If you "just" need help for how to deal with a broken file, then this is not a bug of PyMuPDF, but somebody or something else's bug. This is perfectly fine - I am always willing and approachable to help.

In your case, the file is indeed broken, and the error message is quite clear: item 0 of the xref table (corresponding to 0 0 R) has been used - which is illegal.

If you simply do doc.save("file.pdf") you will see your exception. But if you do doc.save("file.pdf", garbage=x) with x > 0, then everything will just work fine - so your statement "can't save a file" is simply not correct. So, in order to understand what happened on your machine, I need to know what you did - again: the file alone is not enough.

Confirm the above by using MuPDF's cli tool:

mutool clean xref_problem.pdf

will break with the same error message. But

mutool clean -g xref_problem.pdf

will work and produce a cleaned, error-free PDF. The latter command is equivalent to doc.save("out.pdf", garbage=1).

JorjMcKie commented 3 years ago

Additional comment: If you modify that broken PDF by e.g. adding a page, and then save the PDF incrementally, then this will also work. So I re-iterate: what do you mean by "can't save a file"?

fgallaire commented 3 years ago

First of all thanks for all, MuPDF is a powerful tool, and PyMuPDF gives us an easy way to use it. I'm french/anarchist/lawyer, and always stressed by rules :-) So I agree with you there's no bug. But I thinks there is a problem in the default behavior. Who cares about garbage/incremental save options ? PyMuPDF just save me the file if you can !

fgallaire commented 3 years ago

Save incrementally may fix this problem, but won't work on dirty file.

Could we say that garbage=3, deflate=True is a reasonable default ? Which options do you personally use ?

To not change the default .save() behavior for backward compatibility reasons, why not a high-level wrapper for the more sane and reasonable save defaults with a new optionless method .sane_save() which just works great most of the time ?

fgallaire commented 3 years ago

Why a broken file as it doesn't raise the isDirty flag ?

If 1 = remove unused (unreferenced) objects works it's a problem to me: why unused objects should prevents PyMuPDF to save a file ? A warning couldn't be enough ?

JorjMcKie commented 3 years ago

I am happy we are now entering a constructive discussion - thank you for your comments, questions and suggestions! We could have had this earlier with a decent, complete bug report - if I may allude to my initial reaction 😉 for one final time.

  1. is_dirty is exclusively set after changes to a PDF. The problem does not make the file unusable for reading purposes and consequently does not require any repair activity. MuPDF correctly assesses the severity here - and is in agreement with tools like PDF XChange, which also issue similar warnings after open.
  2. An incremental save does work, but does not solve the problem - the xref table stays incorrect. This is necessarily so, because we are appending to the file only, and not rewriting it. BTW the file contains a number of incremental saves (4 in total), and the problem was introduced by one of them (the second one). This error (assigning an object number of 0 to something) only by pure chance does not lead to usability deficiencies in this case, because the so-defined object carries no meaningful information at all - just nonsense.
  3. MuPDF does detect the problem during open and issues a warning. I am however suppressing the display of MuPDF warnings and only store them away, for optional access. This behavior has been requested a long time ago and cannot be reverted now. Many users even voted for having an option to also suppress errors (if they do not prevent meaningful use - i.e. lead to exceptions).
  4. The large number of save options fully reflects what MuPDF has to offer in this area. This granularity is one of its unique differentiators compared to other tools. Strictly speaking: updating a PDF always leaves garbage behind. Garbage collection options allow you to granually fine-tune the amount of stuff being thrown away when rewriting this file. And rewriting always takes place for non-incremental saves. Because of this special error situation, MuPDF's standard way to save - the softest at all possible: collect none of any garbage - does not work, because the xref table header (item 0 - abused as object number 0 here) needs to exist, because it will be updated.
  5. Maybe needless to mention: all garbage collection does not only destroy information (although unused, it may still be important in some legal context!), but also requires extra time ... So for good reasons, introducing a different default behavior is a no-go.

So overall for me, the MuPDF-induced behavior is very adequate. It may be surprising in this case though, because we did not see the warning at open time - it was suppressed.

The only thing I can offer though is introducing an option which disables the warning suppression, so they get displayed again ... e.g. to be set right after importing the package.

fgallaire commented 3 years ago

Thanks for the technical explanations.

The large number of save options fully reflects what MuPDF has to offer in this area. This granularity is one of its unique differentiators compared to other tools.

MuPDF is free software, more powerful, lighter and fastest than other tools, so granularity is probably not its more important differentiator. So a new "easy save" method with sane defaults garbage=3, deflate=True could be a nice shortcut (high-level API improvement). BTW about the garbage option, the granularity doesn't seem good with the "includes prev option" model, why not a binary mask as the permissions one ?

JorjMcKie commented 3 years ago

"includes prev option" model

That's MuPDF's design decision, which I am simply mirroring. I have no choice here, even if I wanted to have it different - which I don't: I do like it the way it is. But you are certainly free to submit this suggestion directly to them: https://bugs.ghostscript.com/enter_bug.cgi.

As per the "easy save" - why not. We already have saveIncr() which is a shortcut for incremental saves. Not to mention that you are of course free to use functools.partial or lambda to achieve this yourself.

fgallaire commented 3 years ago

New ez_save() method in PyMuPDF 1.18.12