sambitdash / PDFIO.jl

PDF Reader Library for Native Julia.
Other
127 stars 13 forks source link

Bad PDF files which have junk characters before header and after EOF marker error unexpected character. #97

Closed diegozea closed 1 year ago

diegozea commented 3 years ago

Hi! I found the Unexpected character error while parsing many of my PDFs. Here is one example of a PDF giving me that error: https://drive.google.com/file/d/1YXdN7TfwK87_5ekbUElYRFOkVLifKj1F/view?usp=sharing

julia> pdDocOpen("/home/diego/Downloads/Vernon et al. - 2018 - Pi-Pi contacts are an overlooked protein feature relevant to phase separation.pdf")
ERROR: Unexpected character
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] doc_trailer_update(ps::IOStream, doc::PDFIO.Cos.CosDocImpl)
   @ PDFIO.Cos ~/.julia/packages/PDFIO/FcFZB/src/CosDoc.jl:399
 [3] cosDocOpen(fp::String; access::Function)
   @ PDFIO.Cos ~/.julia/packages/PDFIO/FcFZB/src/CosDoc.jl:141
 [4] PDFIO.PD.PDDocImpl(fp::String; access::Function)
   @ PDFIO.PD ~/.julia/packages/PDFIO/FcFZB/src/PDDocImpl.jl:16
 [5] pdDocOpen(filepath::String; access::Function)
   @ PDFIO.PD ~/.julia/packages/PDFIO/FcFZB/src/PDDoc.jl:77
 [6] pdDocOpen(filepath::String)
   @ PDFIO.PD ~/.julia/packages/PDFIO/FcFZB/src/PDDoc.jl:77
 [7] top-level scope
   @ REPL[53]:1

My system is:

  [c27321d9] Glob v1.3.0
  [4d0d745f] PDFIO v0.1.12
  [b8865327] UnicodePlots v1.3.0
julia> versioninfo(verbose=true)
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
      Ubuntu 18.04.2 LTS (beaver-osp1-bowen X37)
  uname: Linux 5.4.0-72-generic #80~18.04.1-Ubuntu SMP Mon Apr 12 23:26:25 UTC 2021 x86_64 x86_64
  CPU: Intel(R) Core(TM) i7-10850H CPU @ 2.70GHz: 
                 speed         user         nice          sys         idle          irq
       #1-12  2500 MHz     957939 s       4753 s     202091 s    2197037 s          0 s

  Memory: 15.245685577392578 GB (864.19140625 MB free)
  Uptime: 521353.0 sec
  Load Avg:  1.05  1.38  1.39
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  MANDATORY_PATH = /usr/share/gconf/ubuntu.mandatory.path
  DEFAULTS_PATH = /usr/share/gconf/ubuntu.default.path
  HOME = /home/diego
  WINDOWPATH = 2
  TERM = xterm-256color
  PATH = /home/diego/.local/bin:/home/diego/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/home/diego/bin:/home/diego/.local/bin

I really appreciate any help you can provide.

Best regards,

sambitdash commented 3 years ago

The file is corrupt. The PDF file must start with %PDF and end with %%EOF. While some readers take a lenient stand on it, one cannot say that is the right approach. Anyway I fixed the file and uploading here for reference. fixed.pdf

diegozea commented 3 years ago

Thank you so much for the quick answer. I am having this error with the 75% of my files. Would it be possible to have some keyword argument for allowing parsing this kind of files? Something like permisive=true, but being false by default?

sambitdash commented 3 years ago

These files are not according to the PDF spec. So technically, the behavior of a parser on corrupt files cannot be guaranteed and should not be fixed in a hurry. While I will keep in mind to update the parser to handle some bad files, I cannot make it a guranteed feature in the product. For now you can remove the bad MIME corruptions in the file manually and work with them.

Can be done easily with a binary preserving text editor like vi or emacs on Unix or Linux.

diegozea commented 3 years ago

Thank you very much! There is no hurry at all :D

sambitdash commented 1 year ago

A fix in https://github.com/sambitdash/PDFIO.jl/commit/4a4f0713d840fa6db74b24a25ae4c35cf792d412