Closed joprice closed 4 years ago
Can you supply an example file?
This just loads a webpage for me...
Weird it did the same for me on one access but then worked on the next. Here's the file
(0x97 is the emdash in Windows Code page 1252. In unicode, 0x97 is "end of guarded area".)
The strings associated with Op_Tj are not PDF docstrings, because they are inside the page content. Instead, you have to convert them through the font encoding:
58 0 obj <</Type/Font/Subtype/TrueType/FirstChar 32/LastChar 242/Widths[250 333 420 0 0 0 0 0 0 0 0 0 250 333 250 0 0 0 0 0 0 0 0 0 0 0 333 0 0 0 0 0 0 611 0 667 722 0 0 0 0 0 0 0 556 833 667 0 611 722 0 0 556 0 0 0 0 0 0 0 0 0 0 0 0 500 500 444 500 444 278 0 500 278 278 0 278 722 500 500 500 0 389 389 278 500 444 0 444 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 500 500 0 0 0 0 0 444 0 0 0 0 0 0 0 0 500]/Encoding/WinAnsiEncoding/BaseFont/MIHHPB+TimesNewRoman,Italic/FontDescriptor 30 0 R>> endobj
Normally, there is a nice /ToUnicode entry to help you get straight to unicode codepoints, but not in this old-fashioned PDF. Luckily, however, although this PDF font has been subset, the codepoints have been retained (see all the 0s in the /Widths above). So you just need to convert from /WinAnsiEncoding to UTF8.
CamlPDF should have the information inside it for this -- use the tables in Pdfglyphlist.mli (perhaps inverting some of them...) to go Win codepoint --> /charname --> UTF codepoint. Then convert the sequence of UTF codepoints to UTF8 (again camlpdf can do this)
(Sometimes you may find an old PDF using subset fonts where such text extraction is simply impossible. This is easy to check by copy-and-paste from your PDF reader -- if you get garbage, camlpdf won't be able to do it either.)
Thanks for the info! I'll try that out.
I'm confused how I get a hold of the font associated with the text. I tried matching on Pdfops.Op_Tf, but the font there is a string reference like "/TT2". I'm following some of the examples that use Pdfops.parse_operators
, but maybe that's the wrong approach.
Fonts are part of the /Resources for the /Page, not part of the page contents. You find /TT2 in the /Fonts list in /Resources for the /Page. The PDFtext module will help here, I think?
It looks like you will have to do some work, though, since cpdf -extract-text
can't get the text from this document, which means Pdftext doesn't know how to extract text from it -- but you can use PDftext to read the fonts...
e.g in Pdftext
(** Table of all the entries in an encoding. *)
val table_of_encoding : encoding -> (int, string) Hashtbl.t
looks useful
I see. I'll try that. I stared questioning my whole approach for a second.
Actually I followed your tips and tried out some of the functions in Pdftext, and I think it works by just using existing functions:
let font = lookup_font pdf page.Pdfpage.resources f in
let extractor = Pdftext.text_extractor_of_font pdf font
...
let to_utf8 content extractor =
let codepoints =
Pdftext.codepoints_of_text extractor content
in
Pdftext.utf8_of_codepoints codepoints
Excellent. Looks like I need to fix -extract-text
in cpdf then. Leaving this bug open for that purpose.
I'm trying to use
Pdftext.utf8_of_pdfdocstring
on a string extracted from aPdfops.Op_Tj
after parsing a pdf file, and getting the following error:codepoint_of_pdfdocencoding: bad text string (char 148)
The resulting string has invalid characters like 'Š' in place of the expected em-dashes.