vladholubiev / serverless-libreoffice

Run LibreOffice in AWS Lambda to create PDFs & convert documents
https://vladholubiev.com/serverless-libreoffice
513 stars 75 forks source link

Python Example #34

Open bvandermeersch opened 4 years ago

bvandermeersch commented 4 years ago

Seems the python example no longer works that is located here:

https://github.com/vladgolubev/serverless-libreoffice/blob/master/STEP_BY_STEP.md

Seems tar/zip is no longer available in amazon linux 2 python3.8

Ive also attempted to make it a layer, but I get this error when running it:

sh: /opt/instdir/program/soffice: No such file or directory

Even though it clearly is there.

Anyone else get this working?

bvandermeersch commented 4 years ago

I had to use AWS Lambda Python3.6 to make this work.

Python3.8 Runtime for AWS Lambda does not include curl, tar or zip any more among other packages.

vcrusselle commented 4 years ago

While I think I got to the same spot where you got. I changed the /opt/instdir/program/soffice to /opt/instdir/program/soffice.bin but then I got

sh: instdir/program/soffice.bin: Permission denied

While this is probably laughable here is my code:

import boto3
import os

s3_bucket = boto3.resource("s3").Bucket("************")
convertCommand = "instdir/program/soffice.bin --headless --invisible --nodefault --nofirststartwizard --nolockcheck --nologo --norestore --convert-to pdf --outdir /tmp"

client = boto3.client('s3')
resource = boto3.resource('s3')

def download_dir(client, resource, dist, local='/tmp', bucket='your_bucket'):
    paginator = client.get_paginator('list_objects')
    for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):
        if result.get('CommonPrefixes') is not None:
            for subdir in result.get('CommonPrefixes'):
                download_dir(client, resource, subdir.get('Prefix'), local, bucket)
        for file in result.get('Contents', []):
            dest_pathname = os.path.join(local, file.get('Key'))
            if not os.path.exists(os.path.dirname(dest_pathname)):
                os.makedirs(os.path.dirname(dest_pathname))
            resource.meta.client.download_file(bucket, file.get('Key'), dest_pathname)
        def lambda_handler(event,context):
    print("Starting Process")
    print("Starting Download")
    download_dir(client, resource, 'instdir/', '/tmp', bucket='********')
    print("Download Complete")

    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = record['s3']['object']['key']

        print("Starting Conversion")
        print(os.system("cd /tmp && ls"))
        print(os.system("cd /tmp/instdir && ls"))
        print(os.system("cd /tmp/instdir/program && ls"))
        # Execute libreoffice to convert input file

        os.system(f"cd /tmp && sudo {convertCommand} {key}")
        print("Conversion Complete")

        # Save converted object in S3
        print("Starting Save")
        outputFileName, _ = os.path.splitext(key)
        outputFileName = outputFileName  + ".pdf"
        f = open(f"/tmp/{outputFileName}","rb")
        s3_bucket.put_object(Key=outputFileName,Body=f,ACL="private")
        print("Saving Complete")
        f.close()

I was not able to figure anything about how to get around the missing curl, tar and other dependencies so I decompress the file and uploaded it to an s3 bucket. I work with a company that has to have relatively not dependencies because we work with sensitive data all the time. So I went through all the steps on here but have hit a snag with the 3.8 solution. Guess I will have to settle for the 3.6 solution for now.

vcrusselle commented 4 years ago

So while I could have worked with the NPM module to get tar and brotli to decompress the file I decided to to decompress it locally on my machine (using peazip) and upload just a zip file of the file to a different S3 bucket from the drop buck and the output bucket.

Note: The zip file decompression is handled in memory and not saved to the system until extractall("/tmp") so you may need to allocate roughly 200-300 more memory to this function if you are under the max. I used 1600 for this code example.

Below I have working python 3.8 code:

import boto3
import os
from zipfile import ZipFile
from io import BytesIO

s3_bucket = boto3.resource("s3").Bucket("************-output") #output bucket
zip_obj = boto3.resource("s3").Object(bucket_name="*********-pdf", key="instdir.zip") #bucket that has your zip file in
buffer = BytesIO(zip_obj.get()["Body"].read())
z = ZipFile(buffer)
z.extractall("/tmp")

convertCommand = "instdir/program/soffice.bin --headless --norestore --invisible --nodefault --nofirststartwizard --nolockcheck --nologo --convert-to 'pdf:writer_pdf_Export' --outdir /tmp"
resource = boto3.resource('s3')

def lambda_handler(event,context):

    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = record['s3']['object']['key'].replace("+"," ")

        # Execute libreoffice to convert input file
        print("Elevating Permissions")
        os.system("chmod u+x /tmp/instdir/program/soffice.bin")
        print("Permissions Elevated")
        print("Downloading File")
        resource.meta.client.download_file(bucket, f"{key}", f"/tmp/{key}")
        print("File Downloaded")
        print("Starting Conversion")
        #not sure why you have to run this twice but it works on the second one consistently
        os.system(f"cd /tmp && {convertCommand} '{key}'")
        os.system(f"cd /tmp && {convertCommand} '{key}'")
        print("Conversion Complete")

        # Save converted object in S3
        print("Starting Save")
        outputFileName, path = os.path.splitext(key)
        outputFileName = outputFileName  + ".pdf"
        f = open(f"/tmp/{outputFileName}","rb")
        s3_bucket.put_object(Key=outputFileName,Body=f,ACL="private")
        print("Saving Complete")
        f.close()