laiso / site2pdf

Generate comprehensive PDFs of entire websites, ideal for RAG.
MIT License
157 stars 7 forks source link

Offering site2pdf Online #6

Open laiso opened 1 month ago

laiso commented 1 month ago

I am considering offering site2pdf online for mobile users and non-technologists. This service allows users to enter a website URL, which is then converted into a PDF for download. To achieve this quickly, I evaluated the use of Cloudflare's Browser Rendering API (managed Puppeteer server).

https://gist.github.com/laiso/c36ac504afb2715831ef9410853753fb/

This code uses Cloudflare Workers to provide the following functions:

  1. Extract links from the specified URL and add PDF generation tasks to a queue.
  2. Select a random browser session using Puppeteer.
  3. Fetch messages from the queue, visit URLs using the browser session, generate PDFs, and save them in segments to the R2 bucket.
  4. Retrieve multiple PDF files from the R2 bucket, merge them, and return a single PDF file.

Discovered Issues

  1. Resource Constraints of the Browser Rendering API

    • Only two instances can run simultaneously, each occupying one consumer worker. Therefore, scaling with a queue is not feasible. This seems to be intended for in-house use.
  2. Execution Time and Memory Constraints of Cloudflare Workers

    • The execution time and memory constraints of Cloudflare Workers are insufficient for our tasks. PDF generation tasks consume significant resources, making these constraints a major obstacle.

Future Actions

  1. Consideration of Alternative Cloud Services

    • Firecrawl and Jira Reader are deployed to Fly.io and Cloud Functions. These platforms offer more resources and can scale out PDF generation tasks, making them more suitable.
  2. Development of a Desktop Application #9

    • Create a desktop application using Electron, allowing users to generate PDFs using their resources. This approach avoids cloud resource constraints and enables smoother PDF generation.
Welding-Torch commented 1 month ago

Yeah do this please, it would make using the project straightforward