jrmuizel / pdf-extract

A rust library for extracting content from pdfs
368 stars 75 forks source link

Failure to extract text from AMD GPU ISA docs #37

Closed inequation closed 3 months ago

inequation commented 2 years ago

Frankly, I have no clue whether the problem lies in pdf-extract, or in one of its dependencies, please redirect me if this issue is misplaced.

For the public AMD GPU ISA documentation, such as: https://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf pdf-extract extracts blank pages. Other extractors, such as PyPDF2, extract the text just fine.

jrmuizel commented 2 years ago

That PDF is encrypted. I filed https://github.com/J-F-Liu/lopdf/issues/168 about adding support for RC4 encryption to lopdf.

inequation commented 2 years ago

Thank you! I certainly lack the context knowledge to diagnose this. :)

saviour123 commented 1 year ago

@inequation Any update on this?

inequation commented 1 year ago

@inequation Any update on this?

How should I know? I'm not the developer, just a user. :) I ended up extracting the text manually, as the formatting appears to confuse converters to the point where output is useless - columns in tables get all mixed up between rows.

jwhear commented 1 year ago

FYI, my implementation of RC4 decryption was just merged into lopdf: https://github.com/J-F-Liu/lopdf/pull/228, so this should now be unblocked.

prscoelho commented 5 months ago

@jrmuizel any chance of picking this up now that lopdf supports decryption? Or, would you accept a pr for this?