noDRM / DeDRM_tools

DeDRM tools for ebooks
6.59k stars 287 forks source link

Corruption in PDF files when DRM removed #104

Open patjldeiM0IP opened 1 year ago

patjldeiM0IP commented 1 year ago

Question / bug report

I'm seeing corruption when removing DRM from PDF files — affecting at least Title, Author, Table of Contents. See images for an illustration. Also included are images (and log) of the same files having DRM removed with Apprentice Harpers 6.8.x plugin, where the corruption doesn't occur.

NoDRM https://i.imgur.com/PnkoMMh.png https://i.imgur.com/juubDMT.png https://i.imgur.com/UDEf7Xz.png

Apprentice: https://i.imgur.com/r2Vy9AR.png https://i.imgur.com/ZntKU96.png https://i.imgur.com/QinZsbT.png

Which version of Calibre are you running?

4.23

Which version of the DeDRM plugin are you running?

v10.0.3

If applicable, which version of the Kindle software are you running?

No response

Log output

Log when using NoDRM plugin

calibre 4.23  embedded-python: True is64bit: True
Linux-5.4.0-122-generic-x86_64-with-debian-bullseye-sid Linux ('64bit', 'ELF')
('Linux', '5.4.0-122-generic', '#138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022')
Python 2.7.16
Linux: ('debian', 'bullseye/sid', '')
Interface language: en_GB
Successfully initialized third party plugins: DeDRM (10, 0, 3) && Obok DeDRM (10, 0, 3)
Not controlling automatic hidpi scaling
devicePixelRatio: 1.0
logicalDpi: 111.0 x 111.0
physicalDpi: 102.669473684 x 102.741573034
Using calibre Qt style: True
[0.00] Starting up...
[0.00] Initializing db...
[0.01] db initialized
[0.01] Constructing main UI...
DEBUG:    0.0 obok::utilities.py - loading translations
DEBUG:    0.0 obok::dialogs.py - loading translations
DEBUG:    0.0 obok::config.py - loading translations
DEBUG:    0.0 obok::action_err.py - loading translations
Looking for desktop notifier support from: org.freedesktop.Notifications
org.freedesktop.Notifications found in 0.0 seconds
[1.09] main UI initialized...
[1.09] Started up in 1.09 seconds with 6 books
DeDRM v10.0.3: Trying to decrypt Katherine Seligman - At the Edge of the Haight.pdf
DeDRM v10.0.3: Trying to decrypt John H Seinfeld & Spyros N Pandis - Atmospheric Chemistry and Physics (3e).pdf
DeDRM v10.0.3: Katherine Seligman - At the Edge of the Haight.pdf is a PDF ebook with encryption EBX_HANDLER
DeDRM v10.0.3: Katherine Seligman - At the Edge of the Haight.pdf is a PDF ebook (EBX) for [REDACTED]
DeDRM v10.0.3: Trying encryption key adeptkey
DeDRM v10.0.3: Decrypted with key adeptkey after 2.0 seconds
DeDRM v10.0.3: Finished after 2.0 seconds
Added Seligman_At the Edge_1st ptg_3P.indd to db in: 0.0
DeDRM v10.0.3: John H Seinfeld & Spyros N Pandis - Atmospheric Chemistry and Physics (3e).pdf is a PDF ebook with encryption EBX_HANDLER
DeDRM v10.0.3: John H Seinfeld & Spyros N Pandis - Atmospheric Chemistry and Physics (3e).pdf is a PDF ebook (EBX) for [REDACTED]
DeDRM v10.0.3: Trying encryption key adeptkey
DeDRM v10.0.3: Decrypted with key adeptkey after 56.2 seconds
DeDRM v10.0.3: Finished after 56.2 seconds
Added Atmospheric C9-†õflak›°ÞóÌfòjÖ»çŽà%Î to db in: 0.8
Added 2 books in 57.9 seconds

 

Log when using apprenticeharper plugin

calibre 4.23  embedded-python: True is64bit: True
Linux-5.4.0-122-generic-x86_64-with-debian-bullseye-sid Linux ('64bit', 'ELF')
('Linux', '5.4.0-122-generic', '#138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022')
Python 2.7.16
Linux: ('debian', 'bullseye/sid', '')
Interface language: en_GB
Successfully initialized third party plugins: DeDRM (6, 8, 0) && Obok DeDRM (10, 0, 3)
Not controlling automatic hidpi scaling
devicePixelRatio: 1.0
logicalDpi: 111.0 x 111.0
physicalDpi: 102.669473684 x 102.741573034
Using calibre Qt style: True
[0.00] Starting up...
[0.00] Initializing db...
[0.01] db initialized
[0.01] Constructing main UI...
DEBUG:    0.0 obok::utilities.py - loading translations
DEBUG:    0.0 obok::dialogs.py - loading translations
DEBUG:    0.0 obok::config.py - loading translations
DEBUG:    0.0 obok::action_err.py - loading translations
Looking for desktop notifier support from: org.freedesktop.Notifications
org.freedesktop.Notifications found in 0.0 seconds
[1.04] main UI initialized...
[1.04] Started up in 1.04 seconds with 6 books
DeDRM v6.8.0: Trying to decrypt Katherine Seligman - At the Edge of the Haight.pdf
DeDRM v6.8.0: Katherine Seligman - At the Edge of the Haight.pdf is a PDF ebook
DeDRM v6.8.0: Trying Encryption key adeptkey
DeDRM v6.8.0: Trying to decrypt John H Seinfeld & Spyros N Pandis - Atmospheric Chemistry and Physics (3e).pdf
DeDRM v6.8.0: John H Seinfeld & Spyros N Pandis - Atmospheric Chemistry and Physics (3e).pdf is a PDF ebook
DeDRM v6.8.0: Trying Encryption key adeptkey
DeDRM v6.8.0: Finished after 0.6 seconds
Added Seligman_At the Edge_1st ptg_3P.indd to db in: 0.1
DeDRM v6.8.0: Finished after 36.7 seconds
Added Atmospheric Chemistry and Physics: From Air Pollution to Climate Change to db in: 0.1
Added 2 books in 37.8 seconds
patjldeiM0IP commented 1 year ago

I'd like to add that I've also tried 10.0.0 and 10.0.2 with the same results as 10.0.3

noDRM commented 1 year ago

Okay, that's not supposed to happen...

Is the corruption deterministic? Meaning, when you take one PDF and run it through the plugin multiple times, is the corruption identical?

Can you try if this also occurs with Python3, meaning, Calibre 5 or newer? If you can't update your main Calibre install there's also portable versions of Calibre.

The old 6.8.X versions were mainly for Python2 while the newer versions (and all 10.X ones) are mainly for Python3 but should in theory be backwards-compatible to Python2, so maybe there's a bug in that code.

Would you mind sharing the original PDF (with DRM) and your DER key file with me so I can try to reproduce this on my machine? My email is on my Github profile.

patjldeiM0IP commented 1 year ago

Sorry for the delay in replying, real life got in the way.

If the same PDF is run through the plugin multiple times the visible corruption (ie, title/author/ToC text) will be identical, but the file will have a different SHA256 hash each time (in contrast to when Calibre 6/Py3 is used and the hash is identical each time).

I installed an isolated version of Calibre 6 to test with and every PDF file I've tried DRM was successfully removed with no hint of corruption.

I'm sorry but I'm not comfortable sharing copyright material or my key, but of 14 random PDFs I tried:

Two show corruption in Author One corruption in Title Eight corruption in TOC (100% of those with a TOC) Three fail to open

So if you have a PDF with a TOC that should should hopefully allow you to replicate it with Py2.

The logs of the three that failed to open after DRM was removed is below, they all seem to fail in the same manner - one that isn't present when using Calibre 6/Py3

calibre 4.23  embedded-python: True is64bit: True
Linux-5.4.0-122-generic-x86_64-with-debian-bullseye-sid Linux ('64bit', 'ELF')
('Linux', '5.4.0-122-generic', '#138-Ubuntu SMP Wed Jun 22 15:00:31 UTC 2022')
Python 2.7.16
Linux: ('debian', 'bullseye/sid', '')
Interface language: en_GB
Successfully initialized third party plugins: DeDRM (10, 0, 3) && Obok DeDRM (10, 0, 3)
Not controlling automatic hidpi scaling
devicePixelRatio: 1.0
logicalDpi: 111.0 x 111.0
physicalDpi: 102.669473684 x 102.741573034
Using calibre Qt style: True
[0.00] Starting up...
[0.00] Initializing db...
[0.01] db initialized
[0.01] Constructing main UI...
DEBUG:    0.0 obok::utilities.py - loading translations
DEBUG:    0.0 obok::dialogs.py - loading translations
DEBUG:    0.0 obok::config.py - loading translations
DEBUG:    0.0 obok::action_err.py - loading translations
Looking for desktop notifier support from: org.freedesktop.Notifications
org.freedesktop.Notifications found in 0.0 seconds
[1.18] main UI initialized...
[1.18] Started up in 1.18 seconds with 8 books
DeDRM v10.0.3: Trying to decrypt Ian Goodyer - Crisis Music – The Cultural Politics of Rock Against Racism.pdf
DeDRM v10.0.3: Ian Goodyer - Crisis Music – The Cultural Politics of Rock Against Racism.pdf is a PDF ebook with encryption EBX_HANDLER
DeDRM v10.0.3: Ian Goodyer - Crisis Music – The Cultural Politics of Rock Against Racism.pdf is a PDF ebook (EBX) for UUID [REDACTED]
DeDRM v10.0.3: Trying encryption key adeptkey
DeDRM v10.0.3: Decrypted with key adeptkey after 2.9 seconds
DeDRM v10.0.3: Finished after 2.9 seconds
Internal Error: xref num 1135 not found but needed, try to reconstruct<0a>
Syntax Error: Top-level pages object is wrong type (null)
Command Line Error: Wrong page range given: the first page (1) can not be after the last page (0).
pdfinfo errored out with return code: 99
Traceback (most recent call last):
  File "site-packages/calibre/customize/ui.py", line 428, in get_file_type_metadata
  File "site-packages/calibre/customize/builtins.py", line 343, in get_metadata
  File "site-packages/calibre/ebooks/metadata/pdf.py", line 129, in get_metadata
ValueError: Could not read info dict from PDF
Added Ian Goodyer to db in: 0.1
Added 1 books in 3.5 seconds
DeDRM v10.0.3: Trying to decrypt Jorge Garcia-Robles - The Stray Bullet – William S Burroughs in Mexico.pdf
DeDRM v10.0.3: Jorge Garcia-Robles - The Stray Bullet – William S Burroughs in Mexico.pdf is a PDF ebook with encryption EBX_HANDLER
DeDRM v10.0.3: Jorge Garcia-Robles - The Stray Bullet – William S Burroughs in Mexico.pdf is a PDF ebook (EBX) for UUID [REDACTED]
DeDRM v10.0.3: Trying encryption key adeptkey
DeDRM v10.0.3: Decrypted with key adeptkey after 0.6 seconds
DeDRM v10.0.3: Finished after 0.6 seconds
Internal Error: xref num 603 not found but needed, try to reconstruct<0a>
Syntax Error: Top-level pages object is wrong type (null)
Command Line Error: Wrong page range given: the first page (1) can not be after the last page (0).
pdfinfo errored out with return code: 99
Traceback (most recent call last):
  File "site-packages/calibre/customize/ui.py", line 428, in get_file_type_metadata
  File "site-packages/calibre/customize/builtins.py", line 343, in get_metadata
  File "site-packages/calibre/ebooks/metadata/pdf.py", line 129, in get_metadata
ValueError: Could not read info dict from PDF
Added Jorge Garcia-Robles to db in: 0.0
Added 1 books in 1.0 seconds
DeDRM v10.0.3: Trying to decrypt Nomography – On the Invention of Norms Considered as One of the Fine Arts.pdf
DeDRM v10.0.3: Nomography – On the Invention of Norms Considered as One of the Fine Arts.pdf is a PDF ebook with encryption EBX_HANDLER
DeDRM v10.0.3: Nomography – On the Invention of Norms Considered as One of the Fine Arts.pdf is a PDF ebook (EBX) for UUID [REDACTED]
DeDRM v10.0.3: Trying encryption key adeptkey
DeDRM v10.0.3: Decrypted with key adeptkey after 0.2 seconds
DeDRM v10.0.3: Finished after 0.2 seconds
Internal Error: xref num 408 not found but needed, try to reconstruct<0a>
Syntax Error: Top-level pages object is wrong type (null)
Command Line Error: Wrong page range given: the first page (1) can not be after the last page (0).
pdfinfo errored out with return code: 99
Traceback (most recent call last):
  File "site-packages/calibre/customize/ui.py", line 428, in get_file_type_metadata
  File "site-packages/calibre/customize/builtins.py", line 343, in get_metadata
  File "site-packages/calibre/ebooks/metadata/pdf.py", line 129, in get_metadata
ValueError: Could not read info dict from PDF
Added Nomography – On the Invention of Norms Considered as One of the Fine Arts to db in: 0.0
Added 1 books in 0.6 seconds
noDRM commented 1 year ago

Thanks for the report. I got one PDF with a corrupted TOC and 41df9ecda04083905364dedf033b1bc8df9573b6 fixes that for me, would you mind re-testing your PDFs if they work as well? You can download the new version from https://github.com/noDRM/DeDRM_tools/suites/7695663815/artifacts/321597181

patjldeiM0IP commented 1 year ago

Thanks for the report. I got one PDF with a corrupted TOC and 41df9ec fixes that for me, would you mind re-testing your PDFs if they work as well? You can download the new version from https://github.com/noDRM/DeDRM_tools/suites/7695663815/artifacts/321597181

The PDFs that were showing corruption are no longer corrupt with that version, although it doesn't help for the PDFs mentioned in comment 1193258014 - those are still broken and unable to be opened at all.

noDRM commented 1 year ago

So all the corruption is gone but you still have three PDFs that can't be opened at all after the DRM is removed in Calibre 4 with my plugin but they can be opened without issues in Calibre 5 or 6, or with Calibre 4 with the original plugin from Apprentice Harper?

Interesting ...

Unfortunately this is really difficult to debug without having access to the PDF files in question. I will review the PDF-related code and compare it with the 6.8.0 version again and see if there's any obvious issues I can see, but I don't think that will have a high chance of success. I tested all the available Adobe test PDFs and they're now all working fine in Calibre 4.

I will try to add a ton more logging output to the PDF code to hopefully figure out what's going wrong - I'll give you a couple more test versions soon.

EDIT: For the broken PDFs, where do they "come from"? Fulfilled by an eReader (which one?), downloaded by Adobe Digital Editions (which version?), or downloaded with the ACSM calibre plugin? And is ADE able to open them (with DRM) just fine?

noDRM commented 1 year ago

Would you mind testing if the issue is fixed with this version? https://github.com/noDRM/DeDRM_tools/suites/7702574567/artifacts/322128899

patjldeiM0IP commented 1 year ago

Still broken with that version:

DeDRM v10.0.3: Trying to decrypt Ian Goodyer - Crisis Music – The Cultural Politics of Rock Against Racism.pdf
DeDRM v10.0.3: Trying to decrypt Jorge Garcia-Robles - The Stray Bullet – William S Burroughs in Mexico.pdf
DeDRM v10.0.3: Trying to decrypt Nomography – On the Invention of Norms Considered as One of the Fine Arts.pdf
DeDRM v10.0.3: Nomography – On the Invention of Norms Considered as One of the Fine Arts.pdf is a PDF ebook with encryption EBX_HANDLER
DeDRM v10.0.3: Nomography – On the Invention of Norms Considered as One of the Fine Arts.pdf is a PDF ebook (EBX) for UUID [REDACTED]
DeDRM v10.0.3: Trying encryption key adeptkey
DeDRM v10.0.3: Jorge Garcia-Robles - The Stray Bullet – William S Burroughs in Mexico.pdf is a PDF ebook with encryption EBX_HANDLER
DeDRM v10.0.3: Decrypted with key adeptkey after 0.3 seconds
DeDRM v10.0.3: Finished after 0.3 seconds
DeDRM v10.0.3: Jorge Garcia-Robles - The Stray Bullet – William S Burroughs in Mexico.pdf is a PDF ebook (EBX) for UUID [REDACTED]
DeDRM v10.0.3: Trying encryption key adeptkey
Internal Error: xref num 408 not found but needed, try to reconstruct<0a>
Syntax Error: Top-level pages object is wrong type (null)
Command Line Error: Wrong page range given: the first page (1) can not be after the last page (0).
pdfinfo errored out with return code: 99
Traceback (most recent call last):
  File "site-packages/calibre/customize/ui.py", line 428, in get_file_type_metadata
  File "site-packages/calibre/customize/builtins.py", line 343, in get_metadata
  File "site-packages/calibre/ebooks/metadata/pdf.py", line 129, in get_metadata
ValueError: Could not read info dict from PDF
Added Nomography – On the Invention of Norms Considered as One of the Fine Arts to db in: 0.0
DeDRM v10.0.3: Decrypted with key adeptkey after 0.9 seconds
DeDRM v10.0.3: Finished after 0.9 seconds
Internal Error: xref num 603 not found but needed, try to reconstruct<0a>
Syntax Error: Top-level pages object is wrong type (null)
Command Line Error: Wrong page range given: the first page (1) can not be after the last page (0).
pdfinfo errored out with return code: 99
Traceback (most recent call last):
  File "site-packages/calibre/customize/ui.py", line 428, in get_file_type_metadata
  File "site-packages/calibre/customize/builtins.py", line 343, in get_metadata
  File "site-packages/calibre/ebooks/metadata/pdf.py", line 129, in get_metadata
ValueError: Could not read info dict from PDF
Added Jorge Garcia-Robles to db in: 0.0
DeDRM v10.0.3: Ian Goodyer - Crisis Music – The Cultural Politics of Rock Against Racism.pdf is a PDF ebook with encryption EBX_HANDLER
DeDRM v10.0.3: Ian Goodyer - Crisis Music – The Cultural Politics of Rock Against Racism.pdf is a PDF ebook (EBX) for UUID [REDACTED]
DeDRM v10.0.3: Trying encryption key adeptkey
DeDRM v10.0.3: Decrypted with key adeptkey after 3.5 seconds
DeDRM v10.0.3: Finished after 3.5 seconds
Internal Error: xref num 1135 not found but needed, try to reconstruct<0a>
Syntax Error: Top-level pages object is wrong type (null)
Command Line Error: Wrong page range given: the first page (1) can not be after the last page (0).
pdfinfo errored out with return code: 99
Traceback (most recent call last):
  File "site-packages/calibre/customize/ui.py", line 428, in get_file_type_metadata
  File "site-packages/calibre/customize/builtins.py", line 343, in get_metadata
  File "site-packages/calibre/ebooks/metadata/pdf.py", line 129, in get_metadata
ValueError: Could not read info dict from PDF
Added Ian Goodyer to db in: 0.1
Added 3 books in 4.2 seconds
noDRM commented 1 year ago

Damnit. I don't think there's much I can do about that then, without having access to the PDF and key.

I'm going to leave this issue open as this is definitely a bug, but until I either run into this myself with one of my PDF files, or someone else who is comfortable sharing the PDF and key with me, this is unlikely to get solved.

At least the bug with the silent corruption (where there's no error but the ToC is messed up) is fixed, bugs like these are even worse than just "it doesn't work"...

Thanks for your testing, I will let you know if I find something else.