tinyCodes1 / pdftext

1 stars 0 forks source link

@pdf/pdftext - Simple PDF Text Extraction Module

JSR Badge

@pdf/pdftext is a lightweight module for extracting text from PDFs, built on PDF.js. It can be used in various environments, including:

Features

Deno Usage

import { pdfText } from 'jsr:@pdf/pdftext';

const pdfBuffer: ArrayBuffer = Deno.readFileSync('./path/to/pdf');
const page: { [pageno: number]: string } = await pdfText(pdfBuffer);

// To get page 1
console.log(\`Page 1 text: ${page[1]}\`);

// To get all pages
// console.log(\`All page text: \n${page[0]}\`);

Browser Usage

  1. Download Module: Download the pdftext.js module using curl or a similar utility:
    curl -L -O -C- https://jsr.io/@pdf/pdftext/1.2.8/src/pdftext.js
  2. Minimal HTML Page :
    <script type="module">
    import { pdfText } from './pdftext.js';
    
    document.getElementById('file-input').addEventListener('change', async (event) => {
    const file = event.target.files[0];
    const pdfBuffer = await file.arrayBuffer();
    const page = await pdfText(pdfBuffer);
    
    // To get page 1
    console.log(\`Page 1: ${page[1]}\`);
    
    // To get all pages
    // console.log(\`All page: \n${page[0]}\`);
    });
    </script>
    
    <input type="file" id="file-input" />
    

Command-Line Usage

i. With Deno Installed

  1. Install or Update:
    deno install -frgA jsr:@pdf/pdftext/pdftxt
  2. Usage: Extract text from a PDF:
    pdftxt sample.pdf
  3. Uninstall: To remove the command-line tool:
    deno uninstall pdftxt

ii. Without Deno Installed

  1. Download Executable: Download the appropriate version from the releases page.
  2. Usage:
    • Windows:
      .\pdftxt.exe sample.pdf
    • Linux:
      ./pdftxt sample.pdf

Testing

Typescript code for testing purposes :


import { test } from "jsr:@pdf/pdftext";
console.log(await test());

License

This project is licensed under the MIT License.

Dependencies

Some dependencies of this project may be licensed under different terms. In particular, PDF.js is licensed under the Apache License 2.0.