jwilk-archive / pdf2djvu

PDF to DjVu converter
GNU General Public License v2.0
94 stars 17 forks source link

Rotated text issue #6

Open jwilk opened 16 years ago

jwilk commented 16 years ago

Issue reported by gaiason@yahoo.com at Google Code:

What steps will reproduce the problem?

  1. Get hold of a PDF page with rotated text or just simply rotate a document clockwise or counter-clockwise
  2. Convert the PDF page using PDF to DjVu GUI version 1.0 or 1.1
  3. View the converted DjVu page result to see the problem.

What is the expected output? What do you see instead?

I expect and hope to see the text and and text coordinates of the rotated text to be captured and displayed correctly, however I see a big lump of text with the text coordinates set to the start and end of the block of rorated text.

What version of the product are you using? On what operating system?

PDF to DjVu GUI version 1.0 and 1.1 on Windows XP.

Please provide any additional information below.

http://www.djvu.org/forum/phpbb/viewtopic.php?p=1135&&sid=4fc56a4adfc23e656ba88a463e8e2750#1135

Cheers, Gaiason

jwilk commented 16 years ago

Text extraction was indeed broken. I fixed it, but rotated text is still extracted incorrectly. That's probably because of a DjVuLibre bug.

jwilk commented 16 years ago

See http://sf.net/tracker/?func=detail&aid=1969580&group_id=32953&atid=406583.

jwilk commented 15 years ago

pdftotext is dealing fine with rotated text, so reimplementing its algorithm (rather than relying on DjVuLibre) would solve the problem:

$ pdftotext rotated-lorem.pdf - | grep L
Lorem ipsum
Lorem ipsum

$ pdf2djvu -q rotated-lorem.pdf | djvutxt - | grep L
Lorem ipsum 
Loremipsum 

Attachment: rotated-lorem.pdf