simsong / bulk_extractor

This is the development tree. Production downloads are at:
https://github.com/simsong/bulk_extractor/releases
Other
1.09k stars 187 forks source link

PDF Scanner Misses Most / All Emails #373

Open mthbrown opened 2 years ago

mthbrown commented 2 years ago

Hi,

I was just testing out bulk_extractor. One of my tests was to create the following text file:

abc@google.com
def@google.com
def@gmail.com
abc@google.com
abc@google.com

When I then run bulk_extractor and point it to the text file, I get the expected in email.txt:

# BANNER FILE NOT PROVIDED (-b option)
# BULK_EXTRACTOR-Version: 2.0.0
# Feature-Recorder: email
# Filename: /tmp/1.txt
# Feature-File-Version: 1.1
0       abc@google.com  abc@google.com\015\012def@google.com
16      def@google.com  abc@google.com\015\012def@google.com\015\012def@gmail.com\015
32      def@gmail.com   def@google.com\015\012def@gmail.com\015\012abc@google.com
47      abc@google.com  \012def@gmail.com\015\012abc@google.com\015\012abc@google.com
63      abc@google.com  abc@google.com\015\012abc@google.com\015\012

I then converted the text file to a PDF using pandoc (pdflatex) and opened it in a PDF file and I can clearly see the PDFs (on a single line with spaces between them) as shown here:

Screenshot from 2022-09-19 07-46-43

and here is the related PDF: pandoc.pdf

Now I only get this when I run bulk_extractor:

# BANNER FILE NOT PROVIDED (-b option)
# BULK_EXTRACTOR-Version: 2.0.0
# Feature-Recorder: email
# Filename: emails.pdf
# Feature-File-Version: 1.1
69-PDF-35       def@gmail.com   e f@go ogle.com def@gmail.com ab c@go ogle.co

Finally, when I then opened the text file in Firefox and selected Print to PDF and opened the file in a PDF reader and it showed me the expected text:

Screenshot from 2022-09-19 07-48-46

and here is the related PDF: firefox.pdf

However, now when I run bulk_extractor on the generated PDF, email.txt is empty. Is this expected behavior? Am I missing something? Thanks

simsong commented 2 years ago

Hi. Thank you for submitting the bug report. Would it be possible for you to attach the two PDFs to this ticket?

It turns out that the bulk_extractor PDF to text program does not work the way that most PDF to text programs work, as it is designed to work with fragmented files. Instead of going to the end of the PDF file, reading a table, going to each page, creating the objects, and then interpreting the objects, scan_pdf looks for patterns within the inflated compressed streams and applies some simple heuristics. The heuristics were based on analysis of PDF files in the 2008-2014 time period. But the way the PDFs are created from text changes over time. bulk_extractor was not designed for pdflatex or for Firefox PDF generators. It was designed for Microsoft Word on the Mac and Windows.

A better heuristic would be to take all of the (x,y) locations of the text, drop them into a frame buffer, and then run OCR on the frame buffer. You wouldn't need to do full OCR because you already know what the letters are. You would need to d line and word break detection. You need to find the lines so you know the order to send the characters, and you need the word break because there are no spaces encoded in PDF files.

Do you want to give this a try? bulk_extractor has switches to dump the inflated compressed streams, and then you can write new recognizer that turns the characters into a text stream.

mthbrown commented 2 years ago

Thanks @simsong. I added the PDFs. Unfortunately, I don't know C++

simsong commented 2 years ago

This is an easy way Iearn!

Do you know python? I have been planning on doing a python bridge


Sent from my phone.

On Sep 18, 2022, at 8:21 PM, mthbrown @.***> wrote:



Thanks @simsonghttps://github.com/simsong. I added the PDFs. Unfortunately, I don't know C++

— Reply to this email directly, view it on GitHubhttps://github.com/simsong/bulk_extractor/issues/373#issuecomment-1250421141, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAMFHLC3ELTNUO6B4P3QBDLV66WW5ANCNFSM6AAAAAAQPTWH6M. You are receiving this because you were mentioned.Message ID: @.***>

mthbrown commented 2 years ago

I know some Python