pdf disarm doesn't work

dealbreaker973 commented 1 year ago

Hi, I got the following output when I ran the provided test file, where I tried to check the content of the disarmed_pdf_buffers = disarm_pdfs_by_buffer(filenames, file_buffers) by printing it out:

STARTING DISARM
/JS -> /js
PDFiD 0.2.7 ./Dante.pdf
 PDF Header: %PDF-1.7
 obj                   20
 endobj                20
 stream                 5
 endstream              5
 xref                   1
 trailer                1
 startxref              1
 /Page                  1
 /Encrypt               0
 /ObjStm                0
 /JS                    1
 /JavaScript            0
 /AA                    0
 /OpenAction            0
 /AcroForm              0
 /JBIG2Decode           0
 /RichMedia             0
 /Launch                0
 /EmbeddedFile          0
 /XFA                   0
 /Colors > 2^24         0

{'buffers': []} <-- nothing in the returned buffer

And in testing 3.1, analyze_pdfs_by_buffer actually loaded the file by filename instead of checking the sanitized buffer, which I believe is not the expected behavior.

mlodic commented 1 year ago

hey, thank you for reporting this. But I am just providing the python image here so I expect contributors to help to solve these bugs by providing a PR with a solution

mlodic commented 1 year ago

released your patch 1.1.2

mlodic / pdfid

pdf disarm doesn't work #7