simsong / bulk_extractor

This is the development tree. Production downloads are at:
https://github.com/simsong/bulk_extractor/releases
Other
1.04k stars 184 forks source link

Running bulk_extractor with -J uses multiple threads #402

Closed laissezfarrell closed 6 months ago

laissezfarrell commented 1 year ago

Running bulk_extractor 2.02 with command:

bulk_extractor -S ssn_mode=1 -e outlook -x zip -x rar -x winpe -x exif -x pdf -J -d8 -o /home/accessions/b_e2x_errors/debug_mode05 -R /home/accessions/UA2023-0021/objects/OPD/ -F /home/scripts/be_regex/uaregex.txt

reported "going multi-threaded (24)" and eventually reported "All data read; waiting for threads to finish..."

Execution environment: `

GenuineIntel 0 0 15 0 0 71 808 110 117 262144
  <os_sysname>Linux</os_sysname>
  <os_release>5.15.49-linuxkit</os_release>
  <os_version>#1 SMP Tue Sep 13 07:51:46 UTC 2022</os_version>
  <host>039a1d8462b0</host>
  <arch>x86_64</arch>
  <command_line>bulk_extractor -S ssn_mode=1 -e outlook -x zip -x rar -x winpe -x exif -x pdf -J -d8 -o /home/accessions/b_e2x_errors/debug_mode05 -R /home/accessions/UA2023-0021/objects/OPD/ -F /home/scripts/be_regex/uaregex.txt</command_line>
  <uid>1000</uid>
  <username>rluser</username>
  <start_time>2023-03-15T17:53:45Z</start_time>
</execution_environment>`
simsong commented 1 year ago

Hi. I'm trying to understand the error here. Is it that -J makes it use multiple threads, or it that it hung?

Can you provide /home/accessions/UA2023-0021/objects/OPD/ and /home/scripts/be_regex/uaregex.txt?

laissezfarrell commented 1 year ago

Hi -

With -J, I expected bulk_extractor to use only the primary thread, not run in multi-threaded mode.

The regex file is here: https://github.com/laissezfarrell/rl-bitcurator-scripts/blob/master/be_regex/uaregex.txt

I'm not able to share the content files because of access restrictions. I can share some summary data about them (filetypes, sizes, aggregate size, etc), if that would be helpful, though I'm not sure that would be.

simsong commented 1 year ago

Okay. Let me check this out over the weekend. Thanks.

simsong commented 1 year ago

This will almost certainly require review in the be20_api.

simsong commented 1 year ago

Linked with https://github.com/simsong/be20_api/issues/92

simsong commented 1 year ago

Here is a command line that may exercise the bug:

src/bulk_extractor -S ssn_mode=1 -e outlook -x zip -x rar -x winpe -x exif -x pdf -J -d8 -o out1 -Z -F tests/patterns.txt tests/Images/nps-2010-emails.E01

Also:

src/bulk_extractor --notify_main_thread -S ssn_mode=1 -e outlook -x zip -x rar -x winpe -x exif -x pdf -J -d8 -o out1 -Z -F tests/patterns.txt tests/Images/nps-2010-emails.E01

jonstewart commented 1 year ago

What's the use case for -J/--no-threads? It's not obvious to me; I don't see the advantage of -J over -j 1 (where processing happens on one background thread while the main thread monitors). This disadvantage is the complexity of supporting a different codepath. If there's no use case, maybe this option could just be deleted?

simsong commented 1 year ago

It's very useful for debugging.

simsong commented 6 months ago

Fixed. -J now works properly:

simsong@Simsons-MacBook-Pro bulk_extractor % src/bulk_extractor -1 -S ssn_mode=1 -e outlook -x zip -x rar -x winpe -x exif -x pdf -J -o out1 -R .                                                     (main)bulk_extractor
bulk_extractor version: 2.0.6
Input file: "."
Output directory: "out1"
Disk Size: 1577
Scanners: aes base64 elf evtx facebook find gzip httplogs json kml_carved msxml net ntfsindx ntfslogfile ntfsmft ntfsusn outlook sqlite utmp vcard_carved windirs winlnk winprefetch accts email gps
Threading Disabled
running single-threaded (DEBUG)...
20:21:47 Offset 0MB (0.00%) Done in n/a at 2024-01-15 20:21:46
20:21:48 Offset 0MB (2.09%) Done in  0:00:47 at 2024-01-15 20:22:35
20:21:49 Offset 0MB (2.09%) Done in  0:01:34 at 2024-01-15 20:23:23
20:21:50 Offset 0MB (2.09%) Done in  0:02:21 at 2024-01-15 20:24:11
20:21:51 Offset 0MB (2.09%) Done in  0:03:08 at 2024-01-15 20:24:59
...