Closed fgallaire closed 3 years ago
Please:
@JorjMcKie file too big for GitHub, please download it here: https://store8.gofile.io/download/q4602U/213cb979c2fb48c50b33369556d016c5/xref_problem.pdf
please download it here:
I won't do that until you provide all mandatory information for a bug. Just providing a giant file (which presumably has structure issues) is not enough. You obviously still did not read the issue template for bug submission. What is needed for accepting a bug:
print(fitz.__doc__)
is sufficient.@JorjMcKie I like the fact you are very responsive, but ISTM you are far too stiff about process. I provide a faulty file making the bug reproductible, which is clearly a plus for the project, and you say it is too "giant" for you... Are you serious about a 30 MB, not 30 GB, pdf ? I don't give you the system information because my understanding of the problem tells me it's not a platform dependent but a file related problem. A reproducer script ? Can't save a file. i.e. open the file then try to save it, with PyMuPDF which I suspected you know how to use, do you really think the wording is ambiguous ?
Hm, I am German, so I like rules, hahaha 😎
do you really think the wording is ambiguous ?
Definitely yes! BTW: I never said, the file is too big to download - in fact I nevertheless did download it, in the meantime.
Here is my argument, and I do hope you will understand my point of view:
If you "just" need help for how to deal with a broken file, then this is not a bug of PyMuPDF, but somebody or something else's bug. This is perfectly fine - I am always willing and approachable to help.
In your case, the file is indeed broken, and the error message is quite clear: item 0 of the xref table (corresponding to 0 0 R
) has been used - which is illegal.
If you simply do doc.save("file.pdf")
you will see your exception. But if you do doc.save("file.pdf", garbage=x)
with x > 0, then everything will just work fine - so your statement "can't save a file" is simply not correct.
So, in order to understand what happened on your machine, I need to know what you did - again: the file alone is not enough.
Confirm the above by using MuPDF's cli tool:
mutool clean xref_problem.pdf
will break with the same error message. But
mutool clean -g xref_problem.pdf
will work and produce a cleaned, error-free PDF. The latter command is equivalent to doc.save("out.pdf", garbage=1)
.
Additional comment: If you modify that broken PDF by e.g. adding a page, and then save the PDF incrementally, then this will also work. So I re-iterate: what do you mean by "can't save a file"?
First of all thanks for all, MuPDF is a powerful tool, and PyMuPDF gives us an easy way to use it. I'm french/anarchist/lawyer, and always stressed by rules :-) So I agree with you there's no bug. But I thinks there is a problem in the default behavior. Who cares about garbage/incremental save options ? PyMuPDF just save me the file if you can !
Save incrementally may fix this problem, but won't work on dirty file.
Could we say that garbage=3, deflate=True
is a reasonable default ? Which options do you personally use ?
To not change the default .save()
behavior for backward compatibility reasons, why not a high-level wrapper for the more sane and reasonable save defaults with a new optionless method .sane_save()
which just works great most of the time ?
Why a broken file as it doesn't raise the isDirty
flag ?
If 1 = remove unused (unreferenced) objects
works it's a problem to me: why unused objects should prevents PyMuPDF to save a file ? A warning couldn't be enough ?
I am happy we are now entering a constructive discussion - thank you for your comments, questions and suggestions! We could have had this earlier with a decent, complete bug report - if I may allude to my initial reaction 😉 for one final time.
is_dirty
is exclusively set after changes to a PDF. The problem does not make the file unusable for reading purposes and consequently does not require any repair activity. MuPDF correctly assesses the severity here - and is in agreement with tools like PDF XChange, which also issue similar warnings after open.So overall for me, the MuPDF-induced behavior is very adequate. It may be surprising in this case though, because we did not see the warning at open time - it was suppressed.
The only thing I can offer though is introducing an option which disables the warning suppression, so they get displayed again ... e.g. to be set right after importing the package.
Thanks for the technical explanations.
The large number of save options fully reflects what MuPDF has to offer in this area. This granularity is one of its unique differentiators compared to other tools.
MuPDF is free software, more powerful, lighter and fastest than other tools, so granularity is probably not its more important differentiator.
So a new "easy save" method with sane defaults garbage=3, deflate=True
could be a nice shortcut (high-level API improvement).
BTW about the garbage option, the granularity doesn't seem good with the "includes prev option" model, why not a binary mask as the permissions
one ?
"includes prev option" model
That's MuPDF's design decision, which I am simply mirroring. I have no choice here, even if I wanted to have it different - which I don't: I do like it the way it is. But you are certainly free to submit this suggestion directly to them: https://bugs.ghostscript.com/enter_bug.cgi.
As per the "easy save" - why not. We already have saveIncr()
which is a shortcut for incremental saves. Not to mention that you are of course free to use functools.partial
or lambda
to achieve this yourself.
New ez_save()
method in PyMuPDF 1.18.12
Can't save a file. Error:
Warnigs: