sambitdash / PDFIO.jl

PDF Reader Library for Native Julia.
Other
127 stars 13 forks source link

Error in `merge_encoding!` when extracting text #103

Closed nilshg closed 2 years ago

nilshg commented 2 years ago

In following the "interactive usage" demo in the readme, I end up with an error when trying to extract the text:

julia> pdPageExtractText(stdout, page)

MethodError: no method matching merge_encoding!(::Dict{UInt8, Char}, ::PDFIO.Cos.ID{CosName}, ::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.ID{CosDict})
Closest candidates are:
  merge_encoding!(::Dict{UInt8, Char}, ::CosName, ::CosDoc, ::PDFIO.Cos.IDDRef{CosDict}) at C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:59
  merge_encoding!(::Dict{UInt8, Char}, ::Union{PDFIO.PD.FontMMType1, PDFIO.PD.FontType1}, ::CosDoc, ::PDFIO.Cos.IDDRef{CosDict}) at C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:88
  merge_encoding!(::Union{Nothing, Dict{UInt8, Char}, PDFIO.PD.CMap}, ::PDFIO.PD.FontType, ::CosDoc, ::PDFIO.Cos.IDDRef{CosDict}) at C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:86

Unfortunately I can't share the pdf - is there anything I can do to diagnose this? This is using Julia 1.7.1 on Windows 10, with PDFIO at version 0.1.13

nilshg commented 2 years ago

As an additional data point, R's pdftools can extract the text using its pdf_text function

sambitdash commented 2 years ago

Share the call stack. But that is not enough information but something to look at.

nilshg commented 2 years ago

Sorry, appreciate this isn't ideal without being able to share the document! Here's the full stack:

Stacktrace:
  [1] get_unicode_mapping(doc::PDFIO.Cos.CosDocImpl, font::PDFIO.Cos.ID{CosDict})
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:146
  [2] PDFont
    @ C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:411 [inlined]
  [3] get_pd_font!(doc::PDFIO.PD.PDDocImpl, cosfont::PDFIO.Cos.ID{CosDict})
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDDocImpl.jl:112
  [4] get_font(page::PDFIO.PD.PDPageImpl, fontname::CosName)
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPage.jl:313
  [5] evalContent!(pdo::PDPageElement{:Tf}, state::PDFIO.PD.GState{:PDFIO})
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPageElement.jl:774
  [6] evalContent!
    @ C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPageElement.jl:657 [inlined]
  [7] evalContent!(pdo::PDPageTextObject, state::PDFIO.PD.GState{:PDFIO})
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPageElement.jl:719
  [8] evalContent!
    @ C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPageElement.jl:657 [inlined]
  [9] pdPageEvalContent(page::PDFIO.PD.PDPageImpl, state::PDFIO.PD.GState{:PDFIO})
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPage.jl:145
 [10] pdPageEvalContent
    @ C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPage.jl:144 [inlined]
 [11] pdPageExtractText(io::IJulia.IJuliaStdio{Base.PipeEndpoint}, page::PDFIO.PD.PDPageImpl)
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPage.jl:178
 [12] top-level scope
    @ In[43]:1
 [13] eval
    @ .\boot.jl:373 [inlined]
 [14] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base .\loading.jl:1196
sambitdash commented 2 years ago

Seriously, I need to see this document.

cosDocGetObject does not return an indirect object. Unless I see the document I will be scared to do any changes to the code.

You can mail me at: sambitdash -at- gmail -dot- com if it's ok. I promise to delete the document once I address the issue.

nilshg commented 2 years ago

Sorry, the document is part of disclosure in litigation, it would be illegal for me to share it with third parties. Unless there's any further (nonidentifiable) information I can extract from the document that would help you narrow the issue down I guess I'll just have to close this and begrudgingly use R!

sambitdash commented 2 years ago

Can you provide the pdDocGetInfo dump? At least I can look at the creator and decide if I can access that software and create a PDF through that and see what may be going wrong.

sambitdash commented 2 years ago

You can add this code in your code and it will work as a workaround:

import PDFIO.PD: merge_encoding!

merge_encoding!(mapping::Dict{UInt8, Char}, encoding::ID{CosName}, 
                             cosDoc::CosDoc, font::PDFIO.Cos.IDDRef{CosDict}) = 
      merge_encoding!(mapping, get(encoding), cosDoc, font)

Unfortunately, I cannot add this code to the production build unless I understand why such a change is needed.

nilshg commented 2 years ago

I'm afraid you'll struggle to reproduce this, as I believe the file is an automated dump of information in SAP:

Dict{String, Union{CDDate, String, CosObject}} with 4 entries:
  "Producer"     => "SAP NetWeaver 740 "   
  "Author"       => "xxxxxxx "              # Author name redacted
  "CreationDate" => D:2021xxxxxxxxxxZ       # Exact day/time redacted
  "Creator"      => "Form Zxxx_INVOICE EN"  # Invoice group redacted

With regards to your proposed workaround, I initially got an ID not defined error, so I assumed ID was defined in PDFIO somewhere and added it to the import statement like so: import PDFIO.PD: merge_encoding!, ID. That fixed the undefined error, but now I'm getting the following error:

MethodError: no method matching merge_encoding!(::Dict{UInt8, Char}, ::Symbol, ::PDFIO.Cos.CosDocImpl, ::ID{CosDict})
Closest candidates are:
  merge_encoding!(::Dict{UInt8, Char}, ::CosName, ::CosDoc, ::PDFIO.Cos.IDDRef{CosDict}) at C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:59
  merge_encoding!(::Dict{UInt8, Char}, ::Union{PDFIO.PD.FontMMType1, PDFIO.PD.FontType1}, ::CosDoc, ::PDFIO.Cos.IDDRef{CosDict}) at C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:88
  merge_encoding!(::Union{Nothing, Dict{UInt8, Char}, PDFIO.PD.CMap}, ::PDFIO.PD.FontType, ::CosDoc, ::PDFIO.Cos.IDDRef{CosDict}) at C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:86
  ...

Stacktrace:
  [1] merge_encoding!(mapping::Dict{UInt8, Char}, encoding::ID{CosName}, cosDoc::PDFIO.Cos.CosDocImpl, font::ID{CosDict})
    @ Main .\In[115]:4
  [2] get_unicode_mapping(doc::PDFIO.Cos.CosDocImpl, font::ID{CosDict})
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:146
  [3] PDFont
    @ C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:411 [inlined]
  [4] get_pd_font!(doc::PDFIO.PD.PDDocImpl, cosfont::ID{CosDict})
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDDocImpl.jl:112
  [5] get_font(page::PDFIO.PD.PDPageImpl, fontname::CosName)
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPage.jl:313
  [6] evalContent!(pdo::PDPageElement{:Tf}, state::PDFIO.PD.GState{:PDFIO})
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPageElement.jl:774
  [7] evalContent!
    @ C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPageElement.jl:657 [inlined]
  [8] evalContent!(pdo::PDPageTextObject, state::PDFIO.PD.GState{:PDFIO})
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPageElement.jl:719
  [9] evalContent!
    @ C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPageElement.jl:657 [inlined]
 [10] pdPageEvalContent(page::PDFIO.PD.PDPageImpl, state::PDFIO.PD.GState{:PDFIO})
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPage.jl:145
 [11] pdPageEvalContent
    @ C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPage.jl:144 [inlined]
 [12] pdPageExtractText(io::IJulia.IJuliaStdio{Base.PipeEndpoint}, page::PDFIO.PD.PDPageImpl)
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPage.jl:178
 [13] top-level scope
    @ In[115]:7
 [14] eval
    @ .\boot.jl:373 [inlined]
 [15] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base .\loading.jl:1196

which seems to be the same, just with an additional merge_encoding! call at the top of the stack?

sambitdash commented 2 years ago

I cannot access NetWeaver for sure.

Sorry, try this then.

merge_encoding!(mapping::Dict{UInt8, Char}, encoding::ID{CosName}, 
                             cosDoc::CosDoc, font::PDFIO.Cos.IDDRef{CosDict}) = 
      merge_encoding!(mapping, encoding.obj,  cosDoc, font)
nilshg commented 2 years ago

With this change I get:

MethodError: no method matching cosDocGetObject(::PDFIO.Cos.CosDocImpl, ::ID{CosName}, ::CosName)
Closest candidates are:
  cosDocGetObject(::CosDoc, ::ID{CosDict}, ::Union{PDFIO.Cos.CosNullType, CosName}) at C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\CosDoc.jl:239
  cosDocGetObject(::CosDoc, ::CosDict, ::CosName) at C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\CosDoc.jl:250
  cosDocGetObject(::CosDoc, ::CosIndirectObjectRef, ::Union{PDFIO.Cos.CosNullType, CosName}) at C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\CosDoc.jl:242
  ...

Stacktrace:
  [1] get_glyph_id_mapping(cosdoc::PDFIO.Cos.CosDocImpl, cosfont::ID{CosDict})
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:189
  [2] PDFont
    @ C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDFonts.jl:413 [inlined]
  [3] get_pd_font!(doc::PDFIO.PD.PDDocImpl, cosfont::ID{CosDict})
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDDocImpl.jl:112
  [4] get_font(page::PDFIO.PD.PDPageImpl, fontname::CosName)
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPage.jl:313
  [5] evalContent!(pdo::PDPageElement{:Tf}, state::PDFIO.PD.GState{:PDFIO})
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPageElement.jl:774
  [6] evalContent!
    @ C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPageElement.jl:657 [inlined]
  [7] evalContent!(pdo::PDPageTextObject, state::PDFIO.PD.GState{:PDFIO})
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPageElement.jl:719
  [8] evalContent!
    @ C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPageElement.jl:657 [inlined]
  [9] pdPageEvalContent(page::PDFIO.PD.PDPageImpl, state::PDFIO.PD.GState{:PDFIO})
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPage.jl:145
 [10] pdPageEvalContent
    @ C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPage.jl:144 [inlined]
 [11] pdPageExtractText(io::IJulia.IJuliaStdio{Base.PipeEndpoint}, page::PDFIO.PD.PDPageImpl)
    @ PDFIO.PD C:\Users\ngudat\.julia\packages\PDFIO\KxUq6\src\PDPage.jl:178
 [12] top-level scope
    @ In[117]:7
 [13] eval
    @ .\boot.jl:373 [inlined]
 [14] include_string(mapexpr::typeof(REPL.softscope), mod::Module, code::String, filename::String)
    @ Base .\loading.jl:1196
sambitdash commented 2 years ago

In that case, it will be hard for me to guess a solution.

nilshg commented 2 years ago

No worries, I'll just have to use R in that case - sorry I can't do more and thanks for trying!