subutux / rmapy

A unofficial python module for interacting with the Remarkable Cloud
http://rmapy.readthedocs.io/
MIT License
125 stars 48 forks source link

Downloading + extracting PDFs #30

Open jakecminihan opened 2 years ago

jakecminihan commented 2 years ago

Hi there!

I've been playing around using this and it's great, thank you for your work. I can successfully upload documents to my RM, but I can't figure out how to download ZipDocuments and extract them - I'd like to create a script that can automatically encrypt notebooks I make and return them to the RM Cloud. Here's the code I have working - no idea if this bit is even right or not!

from rmapy.api import Client, ZipDocument

rm = Client()

rm.renew_token()

Doc = [ i for i in rm.get_meta_items() if i.VissibleName == "TestDoc" ][0]
print(Doc)

Printing gives <rmapy.document.Document document ID>.

downloaded_file = rm.download(Doc)
print(downloaded_file)

Again, this yields <rmapy.document.ZipDocument document ID>. Looking at the code, I'm not sure what I should do next. Thank you!

subutux commented 2 years ago

Aha the missing documentation πŸ˜„ !

you’re right on track!

rmapy.document.ZipDocument has a dump() function , to dump the contents into a zipfile.

In your example you could do something like this:

downloaded_file = rm.download(Doc)
downloaded_file.dump("/where/to/store/document.zip")

After you’re done encrypting,

you can follow the guide to upload it back again here but should be something like this:

from rmapy.document import ZipDocument
from rmapy.api import Client

# loads the zip back into a ZipDocument class
to_upload = ZipDocument(file="/where/to/store/document_encrypted.zip")

rm = Client()
rm.renew_token()
rm.upload(to_upload)

let me know if you need anymore pointers!

jakecminihan commented 2 years ago

Thank you for replying πŸ˜„ I'm now 90% of the way there, but I get a really weird error when using ZipDocument:

Traceback (most recent call last):
  File "test2.py", line 112, in <module>
    to_upload = ZipDocument(file= (zip_path + ".zip"))
  File "/home/jake/.local/lib/python3.6/site-packages/rmapy/document.py", line 247, in __init__
    self.load(file)
  File "/home/jake/.local/lib/python3.6/site-packages/rmapy/document.py", line 329, in load
    with zf.open(f"{self.ID}.content", 'r') as content:
  File "/usr/lib/python3.6/zipfile.py", line 1375, in open
    zinfo = self.getinfo(name)
  File "/usr/lib/python3.6/zipfile.py", line 1304, in getinfo
    'There is no item named %r in the archive' % name)
KeyError: "There is no item named 'bfef0d76-c4ce-4ed3-8912-a4fcf2bf5fe2.content' in the archive"

The ID of the "There is no item named" changes each time, and I can't work out what this error is and why it's happening! The dumped zip and encrypted zip have the same file structure (a .content, .pagedata, .pdf then a subfolder with 3 .rm's and .json's) and same ID's. I'm sure it's something I've overlooked, but no idea what it is!

subutux commented 2 years ago

I think the zip file needs the same uuid as the files in the zip content. So: .zip Could you post your test2.py content in full?

Op 12 feb. 2022 om 14:20 heeft jakecminihan @.***> het volgende geschreven:

ο»Ώ Thank you for replying πŸ˜„ I'm now 90% of the way there, but I get a really weird error when using ZipDocument:

Traceback (most recent call last): File "test2.py", line 112, in to_upload = ZipDocument(file= (zip_path + ".zip")) File "/home/jake/.local/lib/python3.6/site-packages/rmapy/document.py", line 247, in init self.load(file) File "/home/jake/.local/lib/python3.6/site-packages/rmapy/document.py", line 329, in load with zf.open(f"{self.ID}.content", 'r') as content: File "/usr/lib/python3.6/zipfile.py", line 1375, in open zinfo = self.getinfo(name) File "/usr/lib/python3.6/zipfile.py", line 1304, in getinfo 'There is no item named %r in the archive' % name) KeyError: "There is no item named 'bfef0d76-c4ce-4ed3-8912-a4fcf2bf5fe2.content' in the archive" The ID of the "There is no item named" changes each time, and I can't work out what this error is and why it's happening! The dumped zip and encrypted zip have the same file structure (a .content, .pagedata, .pdf then a subfolder with 3 .rm's and .json's) and same ID's. I'm sure it's something I've overlooked, but no idea what it is!

β€” Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.

jakecminihan commented 2 years ago

I've just tried implementing that, and it still won't work! It would appear that the UUID is changing from the downloaded document? Here is my code - apologies if it's not great, I'm more than happy to elaborate on any sections that aren't clear! For reference, I'm using a dummy pdf downloaded from https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf.

Code

# Import dependencies
from curses import meta
from rmapy.api import Client, ZipDocument
import zipfile
import os
from os.path import basename
import glob
from PyPDF2 import PdfFileWriter, PdfFileReader
from tqdm import tqdm

# Complete startup checks
rm = Client()

rm.renew_token()

# Obtain data from specific notebook (todo: work out how to pull multiple files from folder)
books = [ i for i in rm.get_meta_items() if i.VissibleName == "dummy" ][0]
metadata = books.to_dict()
#print(metadata)
doc_id = metadata['ID']
print("The UUID is:", doc_id)
name = books.VissibleName

# Download doc(s)
downloaded_file = rm.download(books)

# Delete old file now it's not needed any more
#rm.delete(books) # <- this is not undoable!!!!!

# Unzip to PC
# Set up folder paths
download_path = "/mnt/d/Users/Jake/Documents/PDF Encryption/RM Zips/" + doc_id + ".zip"
extraction_path = "/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + doc_id
# Dump to PC and extract
downloaded_file.dump(download_path)
with zipfile.ZipFile(download_path, 'r') as zip_ref:
    zip_ref.extractall(extraction_path)

# Find PDF
os.chdir(extraction_path)
#print(os.getcwd())

# Create function to encrypt PDFs
def encrypt_pdfs():
    # Create array of PDFs in dir & number
    pdfs = glob.glob("*.pdf")
    number = len(glob.glob("*.pdf"))

    # Iterate over all found pdfs
    for i in tqdm(range(number)):
        # Load file as pdf
        file = pdfs[i]
        file = PdfFileReader(file)

        # Is the file already encrypted? If so, do nothing. Could optimise by checking before loop and removing entries that are?
        if file.isEncrypted == True:
            pass

        # Because this plugin is weirdly made, I have to technically create a *new* PDF. Probably not very efficient!
        else:
            # Create new output PDF
            output = PdfFileWriter()

            # Determine no. of pages in file
            num = file.numPages

            # Iterate over all pages of the PDF and append to a new document
            for index in range(num):
                page = file.getPage(index)
                output.addPage(page)

            # Encrypt the output with a password    
            output.encrypt("PlaceHolderPassword")

            # Write document to file
            with open(pdfs[i], "wb") as f:
                output.write(f)

# Run new function
encrypt_pdfs()

# Create new zip output folder directory
zip_path = "/mnt/d/Users/Jake/Documents/PDF Encryption/Zipped Files/" + doc_id

def zipfolder(foldername, target_dir):            
    zipobj = zipfile.ZipFile(foldername + '.zip', 'w')
    rootlen = len(target_dir)
    for base, dirs, files in os.walk(target_dir):
        for file in files:
            fn = os.path.join(base, file)
            zipobj.write(fn, fn[rootlen:])

zipfolder(zip_path, extraction_path + "/")

# Upload to remarkable
to_upload = ZipDocument(file= ("/mnt/d/Users/Jake/Documents/PDF Encryption/Zipped Files/" + doc_id + ".zip"))

rm.upload(to_upload)

The output to this is:

The UUID is: ab011988-22d4-4eda-bf21-334bb65558c4
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1/1 [00:00<00:00, 53.86it/s]
Traceback (most recent call last):
  File "test2.py", line 103, in <module>
    to_upload = ZipDocument(file= ("/mnt/d/Users/Jake/Documents/PDF Encryption/Zipped Files/" + doc_id + ".zip"))
  File "/home/jake/.local/lib/python3.6/site-packages/rmapy/document.py", line 247, in __init__
    self.load(file)
  File "/home/jake/.local/lib/python3.6/site-packages/rmapy/document.py", line 329, in load
    with zf.open(f"{self.ID}.content", 'r') as content:
  File "/usr/lib/python3.6/zipfile.py", line 1375, in open
    zinfo = self.getinfo(name)
  File "/usr/lib/python3.6/zipfile.py", line 1304, in getinfo
    'There is no item named %r in the archive' % name)
KeyError: "There is no item named '38f0fcc6-a403-4cbc-b949-c719375cc303.content' in the archive"
jakecminihan commented 2 years ago

I've just realised - I think the change comes when I output.write - it's technically a new PDF document I think, because of the way the encrypter has to work. I guess I'd then have to find the new UUID of the doc and change the names of the associated documents too. Any idea how I could find the new UUID?

jakecminihan commented 2 years ago

OK, here's a breakdown of what is currently going on:

My current code is:

# Import dependencies
from curses import meta
from re import I
from rmapy.api import Client, ZipDocument
import zipfile
import os
from os.path import basename
import glob
from PyPDF2 import PdfFileWriter, PdfFileReader
from tqdm import tqdm
import shutil

# Complete startup checks
rm = Client()

rm.renew_token()

# Attempt to upload a file
#rawDocument = ZipDocument(doc='TestPDF.pdf')

#print(rawDocument.metadata)

#print(rm.get_meta_items())

# Obtain items from specific notebook (todo: work out how to pull multiple files from folder)
books = [ i for i in rm.get_meta_items() if i.VissibleName == "dummy" ][0]
metadata = books.to_dict()
#print(metadata)
doc_id = metadata['ID']
print("The UUID is:", doc_id)
name = books.VissibleName

# Download doc(s)
downloaded_file = rm.download(books)

# Delete old file now it's not needed any more
#rm.delete(books) # <- this is not undoable!!!!!

# Unzip to PC
# Set up folder paths
download_path = "/mnt/d/Users/Jake/Documents/PDF Encryption/RM Zips/" + doc_id + ".zip"
extraction_path = "/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + doc_id
# Dump to PC and extract
downloaded_file.dump(download_path)
with zipfile.ZipFile(download_path, 'r') as zip_ref:
    zip_ref.extractall(extraction_path)

# Find PDF
os.chdir(extraction_path)
#print(os.getcwd())

# Create function to encrypt PDFs
def encrypt_pdfs():
    # Create array of PDFs in dir & number
    pdfs = glob.glob("*.pdf")
    number = len(glob.glob("*.pdf"))

    # Iterate over all found pdfs
    for i in tqdm(range(number)):
        # Load file as pdf
        file = pdfs[i]
        file = PdfFileReader(file)

        # Is the file already encrypted? If so, do nothing. Could optimise by checking before loop and removing entries that are?
        if file.isEncrypted == True:
            pass

        # Because this plugin is weirdly made, I have to technically create a *new* PDF. Probably not very efficient!
        else:
            # Create new output PDF
            output = PdfFileWriter()

            # Determine no. of pages in file
            num = file.numPages

            # Iterate over all pages of the PDF and append to a new document
            for index in range(num):
                page = file.getPage(index)
                output.addPage(page)

            # Encrypt the output with a password    
            output.encrypt("Hi")

            # Write document to file
            with open(pdfs[i], "wb") as f:
                output.write(f)

# Run new function
encrypt_pdfs()

# Rename document
os.rename("/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + doc_id + "/" + doc_id + ".pdf", "/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + doc_id + "/" + name + ".pdf")

# Convert doc to remarkable to find new UUID
to_upload = ZipDocument(doc= ("/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + doc_id + "/" + name + ".pdf"))
new_id = (str(to_upload))[-37:-1]
print("New ID is ", new_id)
shutil.copytree(("/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + doc_id), ("/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + new_id))

# Rename all other documents in folder
os.chdir("/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + new_id)
for i in os.listdir():
    src = i
    extension = src.find(".")
    extension = src[extension:]
    dst = new_id + extension
    os.rename(src,dst)

# Create new zip output folder directory
zip_path = "/mnt/d/Users/Jake/Documents/PDF Encryption/Zipped Files/" + new_id
target_path = extraction_path = "/mnt/d/Users/Jake/Documents/PDF Encryption/To Encrypt/" + new_id

def zipfolder(foldername, target_dir):            
    zipobj = zipfile.ZipFile(foldername + '.zip', 'w')
    rootlen = len(target_dir)
    for base, dirs, files in os.walk(target_dir):
        for file in files:
            fn = os.path.join(base, file)
            zipobj.write(fn, fn[rootlen:])

zipfolder(zip_path, extraction_path + "/")

to_upload = ZipDocument(file= ("/mnt/d/Users/Jake/Documents/PDF Encryption/Zipped Files/" + new_id + ".zip"))

#rm.upload(to_upload)