Downloading + extracting PDFs

jakecminihan commented 2 years ago

Hi there!

I've been playing around using this and it's great, thank you for your work. I can successfully upload documents to my RM, but I can't figure out how to download ZipDocuments and extract them - I'd like to create a script that can automatically encrypt notebooks I make and return them to the RM Cloud. Here's the code I have working - no idea if this bit is even right or not!

from rmapy.api import Client, ZipDocument

rm = Client()

rm.renew_token()

Doc = [ i for i in rm.get_meta_items() if i.VissibleName == "TestDoc" ][0]
print(Doc)

Printing gives <rmapy.document.Document document ID>.

downloaded_file = rm.download(Doc)
print(downloaded_file)

Again, this yields <rmapy.document.ZipDocument document ID>. Looking at the code, I'm not sure what I should do next. Thank you!

subutux commented 2 years ago

Aha the missing documentation 😄 !

you’re right on track!

rmapy.document.ZipDocument has a dump() function , to dump the contents into a zipfile.

In your example you could do something like this:

downloaded_file = rm.download(Doc)
downloaded_file.dump("/where/to/store/document.zip")

After you’re done encrypting,

you can follow the guide to upload it back again here but should be something like this:

from rmapy.document import ZipDocument
from rmapy.api import Client

# loads the zip back into a ZipDocument class
to_upload = ZipDocument(file="/where/to/store/document_encrypted.zip")

rm = Client()
rm.renew_token()
rm.upload(to_upload)

let me know if you need anymore pointers!

jakecminihan commented 2 years ago

Thank you for replying 😄 I'm now 90% of the way there, but I get a really weird error when using ZipDocument:

Traceback (most recent call last):
  File "test2.py", line 112, in <module>
    to_upload = ZipDocument(file= (zip_path + ".zip"))
  File "/home/jake/.local/lib/python3.6/site-packages/rmapy/document.py", line 247, in __init__
    self.load(file)
  File "/home/jake/.local/lib/python3.6/site-packages/rmapy/document.py", line 329, in load
    with zf.open(f"{self.ID}.content", 'r') as content:
  File "/usr/lib/python3.6/zipfile.py", line 1375, in open
    zinfo = self.getinfo(name)
  File "/usr/lib/python3.6/zipfile.py", line 1304, in getinfo
    'There is no item named %r in the archive' % name)
KeyError: "There is no item named 'bfef0d76-c4ce-4ed3-8912-a4fcf2bf5fe2.content' in the archive"

The ID of the "There is no item named" changes each time, and I can't work out what this error is and why it's happening! The dumped zip and encrypted zip have the same file structure (a .content, .pagedata, .pdf then a subfolder with 3 .rm's and .json's) and same ID's. I'm sure it's something I've overlooked, but no idea what it is!

subutux commented 2 years ago

I think the zip file needs the same uuid as the files in the zip content. So: .zip Could you post your test2.py content in full?

Op 12 feb. 2022 om 14:20 heeft jakecminihan @.***> het volgende geschreven:

Thank you for replying 😄 I'm now 90% of the way there, but I get a really weird error when using ZipDocument:

Traceback (most recent call last): File "test2.py", line 112, in to_upload = ZipDocument(file= (zip_path + ".zip")) File "/home/jake/.local/lib/python3.6/site-packages/rmapy/document.py", line 247, in init self.load(file) File "/home/jake/.local/lib/python3.6/site-packages/rmapy/document.py", line 329, in load with zf.open(f"{self.ID}.content", 'r') as content: File "/usr/lib/python3.6/zipfile.py", line 1375, in open zinfo = self.getinfo(name) File "/usr/lib/python3.6/zipfile.py", line 1304, in getinfo 'There is no item named %r in the archive' % name) KeyError: "There is no item named 'bfef0d76-c4ce-4ed3-8912-a4fcf2bf5fe2.content' in the archive" The ID of the "There is no item named" changes each time, and I can't work out what this error is and why it's happening! The dumped zip and encrypted zip have the same file structure (a .content, .pagedata, .pdf then a subfolder with 3 .rm's and .json's) and same ID's. I'm sure it's something I've overlooked, but no idea what it is!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.

jakecminihan commented 2 years ago

I've just tried implementing that, and it still won't work! It would appear that the UUID is changing from the downloaded document? Here is my code - apologies if it's not great, I'm more than happy to elaborate on any sections that aren't clear! For reference, I'm using a dummy pdf downloaded from https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf.

Code

# Import dependencies
from curses import meta
from rmapy.api import Client, ZipDocument
import zipfile
import os
from os.path import basename
import glob
from PyPDF2 import PdfFileWriter, PdfFileReader
from tqdm import tqdm

# Complete startup checks
rm = Client()

rm.renew_token()

# Obtain data from specific notebook (todo: work out how to pull multiple files from folder)
books = [ i for i in rm.get_meta_items() if i.VissibleName == "dummy" ][0]
metadata = books.to_dict()
#print(metadata)
doc_id = metadata['ID']
print("The UUID is:", doc_id)
name = books.VissibleName

# Download doc(s)
downloaded_file = rm.download(books)

# Delete old file now it's not needed any more
#rm.delete(books) # <- this is not undoable!!!!!

# Unzip to PC
# Set up folder paths
download_path = "/mnt/d/Users/Jake/Documents/PDF Encryption/RM Zips/" + doc_id + ".zip"
extraction_path = "/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + doc_id
# Dump to PC and extract
downloaded_file.dump(download_path)
with zipfile.ZipFile(download_path, 'r') as zip_ref:
    zip_ref.extractall(extraction_path)

# Find PDF
os.chdir(extraction_path)
#print(os.getcwd())

# Create function to encrypt PDFs
def encrypt_pdfs():
    # Create array of PDFs in dir & number
    pdfs = glob.glob("*.pdf")
    number = len(glob.glob("*.pdf"))

    # Iterate over all found pdfs
    for i in tqdm(range(number)):
        # Load file as pdf
        file = pdfs[i]
        file = PdfFileReader(file)

        # Is the file already encrypted? If so, do nothing. Could optimise by checking before loop and removing entries that are?
        if file.isEncrypted == True:
            pass

        # Because this plugin is weirdly made, I have to technically create a *new* PDF. Probably not very efficient!
        else:
            # Create new output PDF
            output = PdfFileWriter()

            # Determine no. of pages in file
            num = file.numPages

            # Iterate over all pages of the PDF and append to a new document
            for index in range(num):
                page = file.getPage(index)
                output.addPage(page)

            # Encrypt the output with a password    
            output.encrypt("PlaceHolderPassword")

            # Write document to file
            with open(pdfs[i], "wb") as f:
                output.write(f)

# Run new function
encrypt_pdfs()

# Create new zip output folder directory
zip_path = "/mnt/d/Users/Jake/Documents/PDF Encryption/Zipped Files/" + doc_id

def zipfolder(foldername, target_dir):            
    zipobj = zipfile.ZipFile(foldername + '.zip', 'w')
    rootlen = len(target_dir)
    for base, dirs, files in os.walk(target_dir):
        for file in files:
            fn = os.path.join(base, file)
            zipobj.write(fn, fn[rootlen:])

zipfolder(zip_path, extraction_path + "/")

# Upload to remarkable
to_upload = ZipDocument(file= ("/mnt/d/Users/Jake/Documents/PDF Encryption/Zipped Files/" + doc_id + ".zip"))

rm.upload(to_upload)

The output to this is:

The UUID is: ab011988-22d4-4eda-bf21-334bb65558c4
100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 53.86it/s]
Traceback (most recent call last):
  File "test2.py", line 103, in <module>
    to_upload = ZipDocument(file= ("/mnt/d/Users/Jake/Documents/PDF Encryption/Zipped Files/" + doc_id + ".zip"))
  File "/home/jake/.local/lib/python3.6/site-packages/rmapy/document.py", line 247, in __init__
    self.load(file)
  File "/home/jake/.local/lib/python3.6/site-packages/rmapy/document.py", line 329, in load
    with zf.open(f"{self.ID}.content", 'r') as content:
  File "/usr/lib/python3.6/zipfile.py", line 1375, in open
    zinfo = self.getinfo(name)
  File "/usr/lib/python3.6/zipfile.py", line 1304, in getinfo
    'There is no item named %r in the archive' % name)
KeyError: "There is no item named '38f0fcc6-a403-4cbc-b949-c719375cc303.content' in the archive"

jakecminihan commented 2 years ago

I've just realised - I think the change comes when I output.write - it's technically a new PDF document I think, because of the way the encrypter has to work. I guess I'd then have to find the new UUID of the doc and change the names of the associated documents too. Any idea how I could find the new UUID?

jakecminihan commented 2 years ago

OK, here's a breakdown of what is currently going on:

I can download PDFs, encrypt them, and re-upload them but only using ZipDocument(doc="...) and not ZipDocument(file=..., which means that I lose all annotations when doing this
It looks like the document ID for the pdf changes. I first thought it was due to the encryption process, but I added in code to find the new ID after encryption (using x = ZipDocument(file="...) then looking at the output of x should give the new ID; I then updated all the files in the directory to have this new ID before zipping. This didn't work, so I think perhaps the zipping process might change the ID as well?
I'm also unsure how to download folders of notebooks from the remarkable, and how to convert .rm's to pdf, but that's an extension for now and beyond this issue thread

My current code is:

# Import dependencies
from curses import meta
from re import I
from rmapy.api import Client, ZipDocument
import zipfile
import os
from os.path import basename
import glob
from PyPDF2 import PdfFileWriter, PdfFileReader
from tqdm import tqdm
import shutil

# Complete startup checks
rm = Client()

rm.renew_token()

# Attempt to upload a file
#rawDocument = ZipDocument(doc='TestPDF.pdf')

#print(rawDocument.metadata)

#print(rm.get_meta_items())

# Obtain items from specific notebook (todo: work out how to pull multiple files from folder)
books = [ i for i in rm.get_meta_items() if i.VissibleName == "dummy" ][0]
metadata = books.to_dict()
#print(metadata)
doc_id = metadata['ID']
print("The UUID is:", doc_id)
name = books.VissibleName

# Download doc(s)
downloaded_file = rm.download(books)

# Delete old file now it's not needed any more
#rm.delete(books) # <- this is not undoable!!!!!

# Unzip to PC
# Set up folder paths
download_path = "/mnt/d/Users/Jake/Documents/PDF Encryption/RM Zips/" + doc_id + ".zip"
extraction_path = "/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + doc_id
# Dump to PC and extract
downloaded_file.dump(download_path)
with zipfile.ZipFile(download_path, 'r') as zip_ref:
    zip_ref.extractall(extraction_path)

# Find PDF
os.chdir(extraction_path)
#print(os.getcwd())

# Create function to encrypt PDFs
def encrypt_pdfs():
    # Create array of PDFs in dir & number
    pdfs = glob.glob("*.pdf")
    number = len(glob.glob("*.pdf"))

    # Iterate over all found pdfs
    for i in tqdm(range(number)):
        # Load file as pdf
        file = pdfs[i]
        file = PdfFileReader(file)

        # Is the file already encrypted? If so, do nothing. Could optimise by checking before loop and removing entries that are?
        if file.isEncrypted == True:
            pass

        # Because this plugin is weirdly made, I have to technically create a *new* PDF. Probably not very efficient!
        else:
            # Create new output PDF
            output = PdfFileWriter()

            # Determine no. of pages in file
            num = file.numPages

            # Iterate over all pages of the PDF and append to a new document
            for index in range(num):
                page = file.getPage(index)
                output.addPage(page)

            # Encrypt the output with a password    
            output.encrypt("Hi")

            # Write document to file
            with open(pdfs[i], "wb") as f:
                output.write(f)

# Run new function
encrypt_pdfs()

# Rename document
os.rename("/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + doc_id + "/" + doc_id + ".pdf", "/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + doc_id + "/" + name + ".pdf")

# Convert doc to remarkable to find new UUID
to_upload = ZipDocument(doc= ("/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + doc_id + "/" + name + ".pdf"))
new_id = (str(to_upload))[-37:-1]
print("New ID is ", new_id)
shutil.copytree(("/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + doc_id), ("/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + new_id))

# Rename all other documents in folder
os.chdir("/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + new_id)
for i in os.listdir():
    src = i
    extension = src.find(".")
    extension = src[extension:]
    dst = new_id + extension
    os.rename(src,dst)

# Create new zip output folder directory
zip_path = "/mnt/d/Users/Jake/Documents/PDF Encryption/Zipped Files/" + new_id
target_path = extraction_path = "/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + new_id

def zipfolder(foldername, target_dir):            
    zipobj = zipfile.ZipFile(foldername + '.zip', 'w')
    rootlen = len(target_dir)
    for base, dirs, files in os.walk(target_dir):
        for file in files:
            fn = os.path.join(base, file)
            zipobj.write(fn, fn[rootlen:])

zipfolder(zip_path, extraction_path + "/")

to_upload = ZipDocument(file= ("/mnt/d/Users/Jake/Documents/PDF Encryption/Zipped Files/" + new_id + ".zip"))

#rm.upload(to_upload)

subutux / rmapy

Downloading + extracting PDFs #30

Code