shmublu / pdf2txt

0 stars 0 forks source link

PDF2Text

This script parses a PDF file and classifies its lines into paragraphs or headers. The classified lines are then written to an output text file.

Requirements

Usage

The script can be executed from the command line. You need to provide the path to the input PDF file and optionally the output file name.

python pdf2text.py /path/to/input.pdf /path/to/output.txt

Optional Arguments

--max_pages: Maximum number of pages to process (default: 100).

--merge_headers: Whether to merge headers (default: True).

python pdf2text.py /path/to/input.pdf /path/to/output.txt --max_pages 50 --merge_headers False