sambitdash / PDFIO.jl

PDF Reader Library for Native Julia.
Other
127 stars 13 forks source link

Bug: no method matching read_internal_stream_data(::IOStream, ::CosDict, ::Base.GenericIOBuffer{Array{UInt8,1}}) #90

Closed jakewilliami closed 4 years ago

jakewilliami commented 4 years ago

Once again I am recursively reading PDFs and I am trying to use this tool. However, I get to the attached pdf and it throws this error:

ERROR: LoadError: MethodError: no method matching read_internal_stream_data(::IOStream, ::CosDict, ::Base.GenericIOBuffer{Array{UInt8,1}})
Closest candidates are:
  read_internal_stream_data(::IO, ::CosDict, !Matched::Int64) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosReader.jl:256
Stacktrace:
 [1] postprocess_indirect_object(::IOStream, ::Int64, ::CosDict, ::Dict{CosIndirectObjectRef,PDFIO.Cos.CosObjectLoc}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosReader.jl:331
 [2] parse_indirect_obj(::IOStream, ::Int64, ::Dict{CosIndirectObjectRef,PDFIO.Cos.CosObjectLoc}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosReader.jl:359
 [3] cosDocGetObject(::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.CosNullType, ::CosIndirectObjectRef, ::PDFIO.Cos.CosObjectLoc) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosDoc.jl:282
 [4] cosDocGetObject(::PDFIO.Cos.CosDocImpl, ::CosIndirectObjectRef) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosDoc.jl:275
 [5] cosDocGetObject(::PDFIO.Cos.CosDocImpl, ::CosDict, ::CosName) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosDoc.jl:248
 [6] find_resource(::PDFIO.PD.PDFormXObject, ::CosName, ::CosName) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDXObject.jl:54
 [7] get_xobject(::PDFIO.PD.PDFormXObject, ::CosName) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDXObject.jl:62
 [8] evalContent!(::PDPageElement{:Do}, ::PDFIO.PD.GState{:PDFIO}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPageElement.jl:846
 [9] evalContent! at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPageElement.jl:657 [inlined]
 [10] Do(::PDFIO.PD.PDFormXObject, ::PDFIO.PD.GState{:PDFIO}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDXObject.jl:92
 [11] evalContent!(::PDPageElement{:Do}, ::PDFIO.PD.GState{:PDFIO}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPageElement.jl:848
 [12] evalContent! at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPageElement.jl:657 [inlined]
 [13] pdPageEvalContent(::PDFIO.PD.PDPageImpl, ::PDFIO.PD.GState{:PDFIO}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPage.jl:145
 [14] pdPageEvalContent at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPage.jl:144 [inlined]
 [15] pdPageExtractText at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPage.jl:178 [inlined]
 [16] (::var"#3#4"{PDFIO.PD.PDDocImpl})(::IOStream) at /Users/jakeireland/scripts/pdfsearches/pdfsearch.jl:34
 [17] open(::var"#3#4"{PDFIO.PD.PDDocImpl}, ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./io.jl:298
 [18] open at ./io.jl:296 [inlined]
 [19] getPDFText at /Users/jakeireland/scripts/pdfsearches/pdfsearch.jl:23 [inlined]
 [20] scanFiles(::String, ::String) at /Users/jakeireland/scripts/pdfsearches/pdfsearch.jl:67
 [21] top-level scope at /Users/jakeireland/scripts/pdfsearches/pdfsearch.jl:91
 [22] include(::Module, ::String) at ./Base.jl:377
 [23] exec_options(::Base.JLOptions) at ./client.jl:288
 [24] _start() at ./client.jl:484
in expression starting at /Users/jakeireland/scripts/pdfsearches/pdfsearch.jl:91

Any idea why?

Thanks for all the work you do on this! It really is excellent.

1.0 (Limits and Continuity).pdf

sambitdash commented 4 years ago

file.txt

I see no issues in my set up. Can you check if you are using the current versions?

sambitdash commented 4 years ago

You can follow the following steps.

  1. Create a fresh directory and change into that.
  2. $ julia
  3. julia> ]activate .
  4. (dir) pkg> add PDFIO
  5. julia> getPDFText("file.pdf", "file.txt")

Now upload any errors you see. Alongwith your error send me the Project.toml and Manifest.toml files. And also versioninfo(). Following is an output from my machine.

julia> versioninfo()
Julia Version 1.4.2
Commit 44fa15b150* (2020-05-23 18:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: AMD Ryzen 5 2600X Six-Core Processor
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-8.0.1 (ORCJIT, znver1)
jakewilliami commented 4 years ago

I followed your steps

I followed your steps and they worked. So I was very confused why this wasn't working. I found out that the file that was throwing the error was not the one I originally sent—sorry!!

So I followed your steps again with the new file (find directory (compressed) attached):

  1. $ mkdir ~/Desktop/test
  2. $ cd ~/Desktop/test/
  3. $ mv file.pdf ~/Desktop/test/
  4. $ julia
                   _
       _       _ _(_)_     |  Documentation: https://docs.julialang.org
      (_)     | (_) (_)    |
       _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
      | | | | | | |/ _` |  |
      | | |_| | | | (_| |  |  Version 1.4.1 (2020-04-14)
     _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
    |__/                   |
    
  5. (@v1.4) pkg> activate .
     Activating new environment at `~/Desktop/test/Project.toml`
  6. (test) pkg> add PDFIO
    Updating registry at `~/.julia/registries/General`
    Updating git-repo `https://github.com/JuliaRegistries/General.git`
    Resolving package versions...
    Updating `~/Desktop/test/Project.toml`
    [4d0d745f] + PDFIO v0.1.9
    Updating `~/Desktop/test/Manifest.toml`
    [1520ce14] + AbstractTrees v0.2.1
    [715cd884] + AdobeGlyphList v0.1.1
    [9e28174c] + BinDeps v0.8.10
    [34da2185] + Compat v2.2.0
    [2e475f56] + LabelNumerals v0.1.0
    [4d0d745f] + PDFIO v0.1.9
    [27ebfcd6] + Primes v0.4.0
    [9a9db56c] + Rectangle v0.1.2
    [37834d88] + RomanNumerals v0.3.1
    [30578b45] + URIParser v0.4.1
    [2a0f44e3] + Base64
    [ade2ca70] + Dates
    [8bb1440f] + DelimitedFiles
    [8ba89e20] + Distributed
    [b77e0a4c] + InteractiveUtils
    [76f85450] + LibGit2
    [8f399da3] + Libdl
    [37e2e46d] + LinearAlgebra
    [56ddb016] + Logging
    [d6f4376e] + Markdown
    [a63ad114] + Mmap
    [44cfe95a] + Pkg
    [de0858da] + Printf
    [3fa0cd96] + REPL
    [9a3f8284] + Random
    [ea8e919c] + SHA
    [9e88b42a] + Serialization
    [1a1011a3] + SharedArrays
    [6462fe0b] + Sockets
    [2f01184e] + SparseArrays
    [10745b16] + Statistics
    [8dfed614] + Test
    [cf7118a7] + UUIDs
    [4ec0a83e] + Unicode
  7. (test) pkg> ^C (to escape the pkg environment)
  8. julia> using PDFIO
    [ Info: Precompiling PDFIO [4d0d745f-9d9a-592e-8d18-1ad8a0f42b92]
    Updating registry at `~/.julia/registries/General`
    Updating git-repo `https://github.com/JuliaRegistries/General.git`
    Resolving package versions...
    Updating `~/Desktop/test/Project.toml`
    [458c3c95] + OpenSSL_jll v1.1.1+2
    Updating `~/Desktop/test/Manifest.toml`
    [458c3c95] + OpenSSL_jll v1.1.1+2
    ┌ Warning: Package PDFIO does not have OpenSSL_jll in its dependencies:
    │ - If you have PDFIO checked out for development and have
    │   added OpenSSL_jll as a dependency but haven't updated your primary
    │   environment's manifest file, try `Pkg.resolve()`.
    │ - Otherwise you may need to report an issue with PDFIO
    └ Loading OpenSSL_jll into PDFIO from project dependency, future warnings for PDFIO are suppressed.
    Resolving package versions...
    Updating `~/Desktop/test/Project.toml`
    [83775a58] + Zlib_jll v1.2.11+10
    Updating `~/Desktop/test/Manifest.toml`
    [83775a58] + Zlib_jll v1.2.11+10
  9. (need to add the function)

    julia> function getPDFText(src, out)
           # handle that can be used for subsequence operations on the document.
           doc = pdDocOpen(src)
    
           # Metadata extracted from the PDF document.
           # This value is retained and returned as the return from the function.
           docinfo = pdDocGetInfo(doc)
           open(out, "w") do io
    
               # Returns number of pages in the document
               npage = pdDocGetPageCount(doc)
    
               for i=1:npage
    
                   # handle to the specific page given the number index.
                   page = pdDocGetPage(doc, i)
    
                   # Extract text from the page and write it to the output file.
                   pdPageExtractText(io, page)
    
               end
           end
           # Close the document handle.
           # The doc handle should not be used after this call
           pdDocClose(doc)
           return docinfo
       end
    getPDFText (generic function with 1 method)
  10. julia> getPDFText("file.pdf", "file.txt")
    ERROR: MethodError: no method matching read_internal_stream_data(::IOStream, ::CosDict, ::Base.GenericIOBuffer{Array{UInt8,1}})
    Closest candidates are:
    read_internal_stream_data(::IO, ::CosDict, ::Int64) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosReader.jl:256
    Stacktrace:
     [1] postprocess_indirect_object(::IOStream, ::Int64, ::CosDict, ::Dict{CosIndirectObjectRef,PDFIO.Cos.CosObjectLoc}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosReader.jl:331
     [2] parse_indirect_obj(::IOStream, ::Int64, ::Dict{CosIndirectObjectRef,PDFIO.Cos.CosObjectLoc}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosReader.jl:359
     [3] cosDocGetObject(::PDFIO.Cos.CosDocImpl, ::PDFIO.Cos.CosNullType, ::CosIndirectObjectRef, ::PDFIO.Cos.CosObjectLoc) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosDoc.jl:282
     [4] cosDocGetObject(::PDFIO.Cos.CosDocImpl, ::CosIndirectObjectRef) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosDoc.jl:275
     [5] cosDocGetObject(::PDFIO.Cos.CosDocImpl, ::CosDict, ::CosName) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/CosDoc.jl:248
     [6] find_resource(::PDFIO.PD.PDFormXObject, ::CosName, ::CosName) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDXObject.jl:54
     [7] get_xobject(::PDFIO.PD.PDFormXObject, ::CosName) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDXObject.jl:62
     [8] evalContent!(::PDPageElement{:Do}, ::PDFIO.PD.GState{:PDFIO}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPageElement.jl:846
     [9] evalContent! at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPageElement.jl:657 [inlined]
     [10] Do(::PDFIO.PD.PDFormXObject, ::PDFIO.PD.GState{:PDFIO}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDXObject.jl:92
     [11] evalContent!(::PDPageElement{:Do}, ::PDFIO.PD.GState{:PDFIO}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPageElement.jl:848
     [12] evalContent! at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPageElement.jl:657 [inlined]
     [13] pdPageEvalContent(::PDFIO.PD.PDPageImpl, ::PDFIO.PD.GState{:PDFIO}) at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPage.jl:145
     [14] pdPageEvalContent at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPage.jl:144 [inlined]
     [15] pdPageExtractText at /Users/jakeireland/.julia/packages/PDFIO/0uoW0/src/PDPage.jl:178 [inlined]
     [17] open(::var"#3#4"{PDFIO.PD.PDDocImpl}, ::String, ::Vararg{String,N} where N; kwargs::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at ./io.jl:298
     [18] open at ./io.jl:296 [inlined]
  11. versioninfo()
    Julia Version 1.4.1 
    Commit 381693d3df* (2020-04-14 17:20 UTC) 
    Platform Info:
      OS: macOS (x86_64-apple-darwin18.7.0)
      CPU: Intel(R) Core(TM) i7-7660U CPU @ 2.50GHz
      WORD_SIZE: 64
      LIBM: libopenlibm
      LLVM: libLLVM-8.0.1 (ORCJIT, skylake)

I'm very sorry about sending the wrong file! I must have read the error file incorrectly. Thank you for your help

jakewilliami commented 4 years ago

test.zip

sambitdash commented 4 years ago

The bug is due to the length for stream objects are indirect objects embedded in the Object Streams. The current implementation does not look for the length attribute in the object streams.

sambitdash commented 4 years ago

Fix in https://github.com/sambitdash/PDFIO.jl/commit/c8c3c5795af6daebb38731aa2ffc202ea3949b19 file.txt