The attached PDF is failing to decode an Identity-H encoded font with its supplied ToUnicode CMAP.
This is apparent when processed via pdf-tag-dump.raku. This script also explores this a little further:
use PDF::Font::Loader;
use PDF::Font::Loader::FontObj;
use PDF::COS::Dict;
use PDF::Lite;
my PDF::Lite $pdf .= open: "/tmp/SSRN-id4337484.pdf";
my PDF::COS::Dict:D $dict = $pdf.page(1)<Resources><Font><F9>;
my PDF::Font::Loader::FontObj:D $font = PDF::Font::Loader.load-font: :$dict;
my $str = "\x[3]~\0\x[4]\x[1]\x[F]\x[1]µ\x[1]l\x[1]u\x[1]\x[1E]\x[1]]\x[1]o\x[3]U\0\x[3]\x[1]\x[1E]\x[1]\x[9A]\0\x[3]\x[1]\x[2]\x[1]o\x[3]X\x[3]U\0\x[3]\x[3]î\x[3]ì\x[3]î\x[3]í\x[3]V\0";
say $str.comb(/../).map({$font.decode($_, :str)}).join;
Produces: (bukmeiletal2021, whereas the rendered text is (Abukmeil, et al., 2021
The attached PDF is failing to decode an Identity-H encoded font with its supplied ToUnicode CMAP.
This is apparent when processed via
pdf-tag-dump.raku
. This script also explores this a little further:Produces:
(bukmeiletal2021
, whereas the rendered text is(Abukmeil, et al., 2021
SSRN-id4337484.pdf