simsong / bulk_extractor

This is the development tree. Production downloads are at:
https://github.com/simsong/bulk_extractor/releases
Other
1.12k stars 188 forks source link

bulk_extractor needs a quote printable decoder #364

Open Donovoi opened 2 years ago

Donovoi commented 2 years ago

Hi, apologies if you have already addressed this.

I ran bulk extractor version 2.0.1 against a windows 7 raw dd image containing user data as well as the base windows system.

I noticed that in the email histogram that had been produced, one of the emails was shown as 3dklaus1@redacted.com (i've removed the domain so I dont get in trouble)

I used X-Ways to search this email address in the data and the only hits I could find are .eml files showing <a title=3Dklaus1@redacted.com I believe this is not the actual address and is actually quoted-printable encoding. Here is some more for context:

<A title=3Dklaus1@redacted.com =
href=3D"mailto:klaus1@redacted.com">Klaus non=20
  Redacted</A> </DIV>
  <DIV style=3D"FONT: 10pt arial">

and here is where I learnt about it https://stackoverflow.com/a/4016098

Let me know if this has already been address or if you need more info.

Thank you for your work!

simsong commented 2 years ago

Thanks for the posting. I changed the title to reflect what you need.

bulk_extractor doesn't parse files. It looks at bulk data. The problem you have here is that there is quote-printable material that is not being decoded. In your example, it looks like the program would recover both 3Dklaus1@redacted.com and klaus1@redacted.com. Can you verify that it did?

If we add a quote printable decoder it will then recover 3 email addresses in your example, because =3Dklaus1@redacted.com will parse as both =3Dklaus1@redacted.com and then, with a longer forensic path, klaus1@redacted.com. You could remove the first with post-processing.

question - in your question you say that the program found 3dklaus1@redacted.com. However, given your example, it should have found 3Dklaus1@redacted.com. Can you check this for me?

Donovoi commented 2 years ago

Hi @simsong thanks for your reply. Yes I can confirm it recovered both email addresses.

Hi sorry I should have been more clear. It did find 3D - but I believe there is a FLAG that makes all emails in the histogram lowercase? https://github.com/simsong/bulk_extractor/blob/17c2a0d52d67f3dd9bb46f62ed8678c6e48cf525/src/scan_email_lg.cpp#L232

I could be wrong, I'm not a CPP programmer.

simsong commented 2 years ago

You are correct. In the histogram the emails are lowercased. Do you think that a quote printable decoder is worth doing? It's not hard. Do you want to become a C++ programmer?

Donovoi commented 2 years ago

Haha! I would love to! Thank you for the opportunity!

Don't expect anything on par with your work, but I can give it a go :)

simsong commented 2 years ago

It's far easier to develop on Linux or Mac than Windows. Are you okay with that?

Donovoi commented 2 years ago

Yes I can use most well-known distros. Leave it with me and I'll come back with a solution!


From: Simson L. Garfinkel @.> Sent: Wednesday, June 29, 2022 10:53:20 PM To: simsong/bulk_extractor @.> Cc: Michael Moran @.>; Author @.> Subject: Re: [simsong/bulk_extractor] bulk_extractor needs a quote printable decoder (Issue #364)

It's far easier to develop on Linux or Mac than Windows. Are you okay with that?

— Reply to this email directly, view it on GitHubhttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsimsong%2Fbulk_extractor%2Fissues%2F364%23issuecomment-1169943240&data=05%7C01%7C%7Ccbea8887c18d4f8e42da08da59ce5878%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637921040027574410%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=D7xlTr%2Fl%2Fdi6%2F2IZYYpvvXF0d%2BngKr67ffkX8mq9V3o%3D&reserved=0, or unsubscribehttps://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAF5EQS5RBFVXKA7ZJVIQSHLVRRBMBANCNFSM52FOJ6VQ&data=05%7C01%7C%7Ccbea8887c18d4f8e42da08da59ce5878%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637921040027574410%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=42djFOeW9XMaPFApNtZ0RatCupjEw6BWOU653HF57uY%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>

simsong commented 2 years ago

Great. Why don't you try to build under the current Fedora? If you can spin up a VM and build it, I can then give you step-by-step instructions on how to develop the quote-printable decoder. We won't have a method for the decoder to suppress the false positive, but it may be useful for other purposes. And you'll learn something!

Donovoi commented 2 years ago

Woohoo!

Well, I've just created a fedora 36 workstation instance inside QEMU/KVM (which is inside WSL2) it is working well.

I'll await to hear from your regarding next steps.

Thanks again!

simsong commented 2 years ago

Great. You need to do a git clone --recursive on this repo and then apply the script in the etc directory and then verify that you can build and execute the self tests. If you need help to do this, let me know, and I'll develop a readme with you in the repo. We will then expand the readme so that people learn how to develop new modules. Sound cool?

Donovoi commented 2 years ago

Sounds great!

I have successfully run the tests via regress.py.

It does say that some features were not found:

Now reading features from data_check.txt
b'Data/Base64_files/EmailText/RADIX64\xf4\x80\x80\x9c-0-BASE64-2370-ZIP-0-MSXML-2' not found b'RADIX64@RADIX64.com'
b'Data/Base64_files/EmailText/RFC1421\xf4\x80\x80\x9c-106-BASE64-2322-ZIP-1213' not found b'RFC1421@RFC1421.com'
b'Data/Base64_files/EmailText/RFC1642\xf4\x80\x80\x9c-0-BASE64-2370-ZIP-1213' not found b'RFC1642@RFC1642.com'
b'Data/Base64_files/EmailText/RFC2045\xf4\x80\x80\x9c-0-BASE64-2370-ZIP-0-MSXML-2' not found b'RFC2045@RFC2045.com'
b'Data/Base64_files/EmailText/RFC3548\xf4\x80\x80\x9c-0-BASE64-2423-ZIP-0-MSXML-30' not found b'RFC3548@RFC3548.com'
b'Data/Base64_files/JEPG/RFC 1421\xf4\x80\x80\x9c-0-BASE64-0' not found b'057b7e3d9e7a3a3db3e147a6ce16e786'
Total features found: 66
Total features not found: 6

But everything else seems to work as expected.

simsong commented 2 years ago

Sorry for the delay in getting back to you. I've been dealing with a server-down situation on simson.net.

Anyway, the regress.py is a Version 1.0 system. The test for version 2.0 is bin/test_be which runs all of the unit tests. But it looks like you've got this working.

Congrats!

Now the thing to do is to create a branch with git. Let's call it dev-quote-printable. I can add you as a contributor to this repo, or you can fork and do your own.

Have you read the bulk_extractor programmer's manual? I haven't compiled it in a while. Probably the best way for us to do this would be for you to read the manual and then put questions in it, and I'll answer them. In this way the manual will get better.

So here's what you need to do:

  1. Create a scan_quoteprintable.cpp file based one one of the other scanners and hook it in to the autoconf system. Your first version of the scanner should not do anything but init and deinit and register its metadata.
  2. Add scan_quoteprintable to bulk_extractor_scanners.h. (This is new with version 2.0 and the programmer's manual needs to be updated.)
  3. Run bulk_extractor and verify that your scanner appears in the scanner list.
  4. Now you need to make your scanner recognize quote-printable and unquote it. You will do this by scanning the sbuf, looking for quote printable, and writing to a stringstream. Once you catch a certain number of them, you'll make an sbuf with the stringstream and execute a recursive call. I can show you where this happens, and it should be properly documented.
  5. Now you need to create a unit test.
  6. Finally, we need to think about how to suppress false positives. That's more art than science.

We might also want to create new hook in the feature recorder so that passthrough features are automatically discarded. That is, these two features are probably the same and the second should not be reported:

1234567    user@company.com
1234000-QUOTEPRINTABLE-467 user@company.com

If this sounds like something you can do, I can create the blank scanner to get you going.

Donovoi commented 2 years ago

Thank you for those instructions! I'll have a read of the manual and fork the repo just so I can make mistakes and not have it be a be a problem on someone else's blood, sweat, and tears ha

This will take a bit of time for me as I'll need to learn a few things and juggle some assignments. But hopefully I will have something of a draft within the week. Not promising anything as something might come up.

I'll be sure to post any questions on the programmers manual.

Leave it with me 😁