sambitdash / PDFIO.jl

PDF Reader Library for Native Julia.
Other
127 stars 13 forks source link

assert error reading a pdf #111

Closed manentai closed 9 months ago

manentai commented 9 months ago

I am trying to extract text from a pdf created from GoogleDoc, but I am getting into an error. test.pdf

This is the code I am using:

doc = pdDocOpen(filepath)
print("file opened\n")
docinfo = pdDocGetInfo(doc) 
print("file info\n")
print(docinfo)
print("\n")
npage = pdDocGetPageCount(doc)
print(npage)
page=pdDocGetPage(doc, 1)
page1txt=pdPageExtractText(stdout, page);
print(page1txt)
print("\n")

This is the error message I get

file opened
file info
Dict{String, Union{CDDate, String, CosObject}}("Creator" => "þÿ\0G\0o\0o\0g\0l\0e", "Title" => "þÿ\0g\0i\0a\0n\0o\0 \0p\0i\0t\0c\0h\0 \0l\0o\0n\0d\0o\0n")
9                                             giano.rocksERROR: AssertionError: w > 0.1f0
Stacktrace:
 [1] show_text_layout!(io::Base.TTY, state::PDFIO.PD.GState{:PDFIO})
   @ PDFIO.PD ~/.julia/packages/PDFIO/Fv2i0/src/PDPageElement.jl:580
 [2] pdPageExtractText(io::Base.TTY, page::PDFIO.PD.PDPageImpl)
   @ PDFIO.PD ~/.julia/packages/PDFIO/Fv2i0/src/PDPage.jl:110
 [3] top-level scope
   @ ~/projects/flaskGiano/backend.jl:49

Not sure what is happening here, maybe this kind of pdfs (company pitches) are not suited for this package?

sambitdash commented 9 months ago

Here is the dump from the run on my machine with Julia 1.9.4. I do not see any errors.

julia> print("file opened\n")
file opened

julia> docinfo = pdDocGetInfo(doc)
Dict{String, Union{CDDate, String, CosObject}} with 2 entries:
  "Creator" => "þÿ\0G\0o\0o\0g\0l\0e"
  "Title"   => "þÿ\0g\0i\0a\0n\0o\0 \0p\0i\0t\0c\0h\0 \0l\0o\0n\0d\0o\0n"

julia> print("file info\n")
file info

julia> print(docinfo)
Dict{String, Union{CDDate, String, CosObject}}("Creator" => "þÿ\0G\0o\0o\0g\0l\0e", "Title" => "þÿ\0g\0i\0a\0n\0o\0 \0p\0i\0t\0c\0h\0 \0l\0o\0n\0d\0o\0n")
julia> print("\n")

julia> npage = pdDocGetPageCount(doc)
9

julia> print(npage)
9
julia> page=pdDocGetPage(doc, 1)
PDFIO.PD.PDPageImpl(
PDDoc ==>

CosDoc ==>
        filepath:               C:\work\test\test.pdf
        size:                   1273694
        hasNativeXRefStm:        false
        Trailer dictionaries:
        <<
        /Size   167
        /Root   4 0 R
        /Info   5 0 R
>>

Catalog:
4 0 obj
<<
        /Pages  1 0 R
        /Type   /Catalog
        /PageLabels     <<
        /Nums   [0 <<
        /St     1
        /S      /D
>> ]
>>
        /Outlines       2 0 R
        /Names  <<
        /JavaScript     3 0 R
>>
>>
endobj

isTagged: none
,
6 0 obj
<<
        /Annots 10 0 R
        /Type   /Page
        /MediaBox       [0 0 720 405 ]
        /Resources      8 0 R
        /Group  <<
        /S      /Transparency
        /CS     /DeviceRGB
>>
        /Parent 1 0 R
        /Contents       7 0 R
>>
endobj

, null, nothing, Dict{CosName, PDFIO.PD.PDFont}(), Dict{CosName, PDFIO.PD.PDXObject}())

julia> page1txt=pdPageExtractText(stdout, page);
                                             giano.rocks
julia> print(page1txt)

julia> print("\n")

julia>
sambitdash commented 9 months ago

Here is the extracted dump of the file from the following code.

julia> function getPDFText(src, out)
           # handle that can be used for subsequence operations on the document.
           doc = pdDocOpen(src)

           # Metadata extracted from the PDF document.
           # This value is retained and returned as the return from the function.
               docinfo = pdDocGetInfo(doc)
           open(out, "w") do io

               # Returns number of pages in the document
               npage = pdDocGetPageCount(doc)

                       for i=1:npage

                   # handle to the specific page given the number index.
                   page = pdDocGetPage(doc, i)

                   # Extract text from the page and write it to the output file.
                   pdPageExtractText(io, page)

               end
           end
           # Close the document handle.
           # The doc handle should not be used after this call
           pdDocClose(doc)
           return docinfo
           end
getPDFText (generic function with 1 method)

julia> getPDFText("test.pdf", "test.txt")
Dict{String, Union{CDDate, String, CosObject}} with 2 entries:
  "Creator" => "þÿ\0G\0o\0o\0g\0l\0e"
  "Title"   => "þÿ\0g\0i\0a\0n\0o\0 \0p\0i\0t\0c\0h\0 \0l\0o\0n\0d\0o\0n"
digitaldust commented 9 months ago

thank you for this, I have reinstalled Julia with juliaup and that error disappeared, although I now get this:

ERROR: Invalid file header
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:35
 [2] cosDocOpen(fp::String; access::Function)
   @ PDFIO.Cos ~/.julia/packages/PDFIO/V4YuN/src/CosDoc.jl:136
 [3] cosDocOpen
   @ ~/.julia/packages/PDFIO/V4YuN/src/CosDoc.jl:132 [inlined]
 [4] PDFIO.PD.PDDocImpl(fp::String; access::Function)
   @ PDFIO.PD ~/.julia/packages/PDFIO/V4YuN/src/PDDocImpl.jl:16
 [5] PDDocImpl
   @ ~/.julia/packages/PDFIO/V4YuN/src/PDDocImpl.jl:15 [inlined]
 [6] pdDocOpen(filepath::String; access::Function)
   @ PDFIO.PD ~/.julia/packages/PDFIO/V4YuN/src/PDDoc.jl:77
 [7] pdDocOpen
   @ ~/.julia/packages/PDFIO/V4YuN/src/PDDoc.jl:76 [inlined]

I suspect I need to install/update stuff on my system, like Apache Tika or similar... I am on Ubuntu 18.04, with openjdk 17.0.7 and running Julia 1.9.4

sambitdash commented 9 months ago

Did you see this error with the same file? I do not see such a problem. I received this as the text dump of the pdf file. test.txt

digitaldust commented 9 months ago

yes with the same file...

@sambitdash is there something that must be on my system in order for PDFIO to work properly?

sambitdash commented 9 months ago

yes with the same file...

@sambitdash is there something that must be on my system in order for PDFIO to work properly?

Please create a fresh environment.

  1. Create a new folder and make it the current working directory.
  2. Start Julia
  3. Activate the project with ] activate .
  4. Add PDFIO to your project with add PDFIO
  5. Now use the PDFIO in this environment.

This is a fresh environment and I did not see any issue with PDFIO. If you are seeing issues with PDFIO in some other environment, I will assume some packages are interfering with PDFIO's working. That will be hard to isolate though.

manentai commented 9 months ago

a new environment solved the problem, thanks for your patience!