utf8 encoding failed on emdash '—'

joprice commented 4 years ago

I'm trying to use Pdftext.utf8_of_pdfdocstring on a string extracted from a Pdfops.Op_Tj after parsing a pdf file, and getting the following error:

codepoint_of_pdfdocencoding: bad text string (char 148)

The resulting string has invalid characters like 'Š' in place of the expected em-dashes.

johnwhitington commented 4 years ago

Can you supply an example file?

joprice commented 4 years ago

http://www.dominiopublico.gov.br/download/texto/bn000012.pdf

johnwhitington commented 4 years ago

This just loads a webpage for me...

joprice commented 4 years ago

Weird it did the same for me on one access but then worked on the next. Here's the file

bn000012.pdf

johnwhitington commented 4 years ago

(0x97 is the emdash in Windows Code page 1252. In unicode, 0x97 is "end of guarded area".)

The strings associated with Op_Tj are not PDF docstrings, because they are inside the page content. Instead, you have to convert them through the font encoding:

58 0 obj <</Type/Font/Subtype/TrueType/FirstChar 32/LastChar 242/Widths[250 333 420 0 0 0 0 0 0 0 0 0 250 333 250 0 0 0 0 0 0 0 0 0 0 0 333 0 0 0 0 0 0 611 0 667 722 0 0 0 0 0 0 0 556 833 667 0 611 722 0 0 556 0 0 0 0 0 0 0 0 0 0 0 0 500 500 444 500 444 278 0 500 278 278 0 278 722 500 500 500 0 389 389 278 500 444 0 444 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 500 500 0 0 0 0 0 444 0 0 0 0 0 0 0 0 500]/Encoding/WinAnsiEncoding/BaseFont/MIHHPB+TimesNewRoman,Italic/FontDescriptor 30 0 R>> endobj

Normally, there is a nice /ToUnicode entry to help you get straight to unicode codepoints, but not in this old-fashioned PDF. Luckily, however, although this PDF font has been subset, the codepoints have been retained (see all the 0s in the /Widths above). So you just need to convert from /WinAnsiEncoding to UTF8.

CamlPDF should have the information inside it for this -- use the tables in Pdfglyphlist.mli (perhaps inverting some of them...) to go Win codepoint --> /charname --> UTF codepoint. Then convert the sequence of UTF codepoints to UTF8 (again camlpdf can do this)

johnwhitington commented 4 years ago

(Sometimes you may find an old PDF using subset fonts where such text extraction is simply impossible. This is easy to check by copy-and-paste from your PDF reader -- if you get garbage, camlpdf won't be able to do it either.)

joprice commented 4 years ago

Thanks for the info! I'll try that out.

joprice commented 4 years ago

I'm confused how I get a hold of the font associated with the text. I tried matching on Pdfops.Op_Tf, but the font there is a string reference like "/TT2". I'm following some of the examples that use Pdfops.parse_operators, but maybe that's the wrong approach.

johnwhitington commented 4 years ago

Fonts are part of the /Resources for the /Page, not part of the page contents. You find /TT2 in the /Fonts list in /Resources for the /Page. The PDFtext module will help here, I think?

It looks like you will have to do some work, though, since cpdf -extract-text can't get the text from this document, which means Pdftext doesn't know how to extract text from it -- but you can use PDftext to read the fonts...

johnwhitington commented 4 years ago

e.g in Pdftext

(** Table of all the entries in an encoding. *)
val table_of_encoding : encoding -> (int, string) Hashtbl.t

looks useful

joprice commented 4 years ago

I see. I'll try that. I stared questioning my whole approach for a second.

joprice commented 4 years ago

Actually I followed your tips and tried out some of the functions in Pdftext, and I think it works by just using existing functions:


let font = lookup_font pdf page.Pdfpage.resources f in
let extractor = Pdftext.text_extractor_of_font pdf font

...

let to_utf8 content extractor =
    let codepoints =
      Pdftext.codepoints_of_text extractor content
    in
    Pdftext.utf8_of_codepoints codepoints

johnwhitington commented 4 years ago

Excellent. Looks like I need to fix -extract-text in cpdf then. Leaving this bug open for that purpose.

johnwhitington / camlpdf

utf8 encoding failed on emdash '—' #39