@pdf/pdftext - Simple PDF Text Extraction Module

@pdf/pdftext is a lightweight module for extracting text from PDFs, built on PDF.js. It can be used in various environments, including:

Deno
Browser
Command-line
- With Deno installed
- Without Deno installed

Features

Easy to use for deno, browser, and command-line.
PDF text extraction with minimal configuration.
Cross-platform support for Windows, Linux, and macOS.

Deno Usage

import { pdfText } from 'jsr:@pdf/pdftext';

const pdfBuffer: ArrayBuffer = Deno.readFileSync('./path/to/pdf');
const page: { [pageno: number]: string } = await pdfText(pdfBuffer);

// To get page 1
console.log(\`Page 1 text: ${page[1]}\`);

// To get all pages
// console.log(\`All page text: \n${page[0]}\`);

Browser Usage

Download Module: Download the pdftext.js module using curl or a similar utility:
```
curl -L -O -C- https://jsr.io/@pdf/pdftext/1.2.8/src/pdftext.js
```

Minimal HTML Page :

<script type="module">
import { pdfText } from './pdftext.js';

document.getElementById('file-input').addEventListener('change', async (event) => {
const file = event.target.files[0];
const pdfBuffer = await file.arrayBuffer();
const page = await pdfText(pdfBuffer);

// To get page 1
console.log(\`Page 1: ${page[1]}\`);

// To get all pages
// console.log(\`All page: \n${page[0]}\`);
});
</script>

<input type="file" id="file-input" />

Command-Line Usage

i. With Deno Installed

Install or Update:

deno install -frgA jsr:@pdf/pdftext/pdftxt

Usage: Extract text from a PDF:
```
pdftxt sample.pdf
```
Uninstall: To remove the command-line tool:
```
deno uninstall pdftxt
```

ii. Without Deno Installed

Download Executable: Download the appropriate version from the releases page.

Usage:

Windows:
```
.\pdftxt.exe sample.pdf
```
Linux:
```
./pdftxt sample.pdf
```

Testing

Typescript code for testing purposes :


import { test } from "jsr:@pdf/pdftext";
console.log(await test());

License

This project is licensed under the MIT License.

Dependencies

Some dependencies of this project may be licensed under different terms. In particular, PDF.js is licensed under the Apache License 2.0.

tinyCodes1 / pdftext

readme