sambitdash / PDFIO.jl

PDF Reader Library for Native Julia.
Other
127 stars 13 forks source link

invalid or incomplete deflate data #88

Closed FinPl closed 4 years ago

FinPl commented 4 years ago

Hello, I am encountering an error while trying to extract text from the first page of this document:

https://www.diffusion.transports.gouv.qc.ca/ords/pes/APEX_PES.P_PESB_DSI_AFFCH_RIG?P_VC_NUM_DOSSR=00007

I am rather new to pdf parsing and as I understand it there might be a problem with the compression used. Other documents which are similar work perfectly fine.

This one works:

https://www.diffusion.transports.gouv.qc.ca/ords/pes/APEX_PES.P_PESB_DSI_AFFCH_RIG?P_VC_NUM_DOSSR=00003

Can you help me solve that issue?

sambitdash commented 4 years ago

The compressed stream for the content stream is corrupt. Hence, at some point the extraction will not be completed. Accepting partial corrupt data can make the file pass through but some data may be corrupt due to bad flate compressed data.

sambitdash commented 4 years ago

Fix in: https://github.com/sambitdash/PDFIO.jl/commit/208d064bc2f7bc0263fbd02d177a4542e98b9183

There can never be a perfect solution when the data is corrupt, whatever data can be recovered is recovered.