metebalci / pdftitle

a utility to extract the title from a PDF file
GNU General Public License v3.0
131 stars 21 forks source link

Algorithm "eliot" implemented incorrectly #31

Open crh23 opened 2 years ago

crh23 commented 2 years ago

In the implementation of the "eliot" algorithm, the y coordinates are sorted low-to-high: https://github.com/metebalci/pdftitle/blob/5ebc1a0ec3f347e5a257485bc6ce43a9f12798ba/pdftitle.py#L543-L548

Since the origin of a pdf is the bottom-left corner, the y coordinates should be sorted high-to-low, as in the implementation of the "original" algorithm: https://github.com/metebalci/pdftitle/blob/5ebc1a0ec3f347e5a257485bc6ce43a9f12798ba/pdftitle.py#L492-L494

Sorting could be done as:

selected_blocks = sorted(selected_blocks, key=lambda b: (-b[3], b[2]))

since tuples follow lexicographical ordering (see here)

metebalci commented 1 year ago

Note to myself: find/create a PDF showing this error, and then fix.

user202729 commented 9 months ago

Creating one isn't too difficult. Compile the following with PDFLaTeX:

%! TEX program = pdflatex
\documentclass{article}
\begin{document}

\title{A \texttt{document} \\ with two lines title}
\author{Author name}
\date{February 30, 2023}
\maketitle

\end{document}

The output is, as expected, withtwolinestitleAdocument.

On another note, the spaces disappear in the output as well (even in the second line which is unaffected), I have no idea why.