This script parses a PDF file and classifies its lines into paragraphs or headers. The classified lines are then written to an output text file.
openparse
libraryargparse
libraryThe script can be executed from the command line. You need to provide the path to the input PDF file and optionally the output file name.
python pdf2text.py /path/to/input.pdf /path/to/output.txt
--max_pages: Maximum number of pages to process (default: 100).
--merge_headers: Whether to merge headers (default: True).
python pdf2text.py /path/to/input.pdf /path/to/output.txt --max_pages 50 --merge_headers False