simsong / bulk_extractor

This is the development tree. Production downloads are at:
https://github.com/simsong/bulk_extractor/releases
Other
1.09k stars 187 forks source link

bulk_extractor hangs with -F in std::regex #405

Closed laissezfarrell closed 8 months ago

laissezfarrell commented 1 year ago

Checklist:

Original Report

Running bulk_extractor 2.02 with this command: bulk_extractor -S ssn_mode=1 -e outlook -x zip -x rar -x winpe -x exif --no_threads -o /home/accessions/b_e2x_errors/debug_mode02 -R /home/accessions/UA2023-0021/objects/OPD/ -F /home/scripts/be_regex/uaregex.txt

Ran bulk_extractor in multi-threaded mode. When bulk_extractor reported All data read; waiting for threads to finish...

the process proceeded to hang for over three days until I killed the process manually.

simsong commented 1 year ago

Thank you so much for this. It's clear that there is a hang on the 2.0 multithreaded dispatch system. Does this happen reliably? Can you give me the data that makes this happen?

What happens when you type ^C and then restart the program?

laissezfarrell commented 1 year ago

This happened reliably with that set of files. Killing the process and restarting resulted in the same behavior.

In other cases, different sets of files resulted in All data read; waiting for threads to finish... but only for a period of time before, presumably, the threads finished and the application completed its run.

laissezfarrell commented 1 year ago

If helpful, I can provide both hung/aborted bulk_extractor 2.0 report XML files and successfully completed bulk_extractor 1.5 report XML files reporting on the same set of files.

simsong commented 1 year ago

Oh, I’m pretty sure the issue is with the multi-threading code. I’ll get you a temp release to try, once I have a chance to work on this.

simsong commented 1 year ago

This seems to be a problem in the find scanner...

command line:

% src/bulk_extractor --notify_main_thread -S ssn_mode=1 -e outlook -x zip -x rar -x winpe -x exif -x pdf -J -d8 -o out1 -Z  -F tests/patterns.txt tests/Images/nps-2010-emails.E01
opening tests/Images/nps-2010-emails.E01

bulk_extractor version: 2.0.3
Input file: "tests/Images/nps-2010-emails.E01"
Output directory: "out1"
Disk Size: 10485760
Scanners: aes base64 elf evtx facebook find gzip httplogs json kml_carved msxml net ntfsindx ntfslogfile ntfsmft ntfsusn outlook sqlite utmp vcard_carved windirs winlnk winprefetch accts email gps
Threading Disabled
running single-threaded (DEBUG)...

Then I typed ^T a few times:

load: 5.37  cmd: bulk_extractor 68295 running 88.72u 18.47s
load: 4.40  cmd: bulk_extractor 68295 running 23046.41u 8449.16s
load: 4.13  cmd: bulk_extractor 68295 running 23048.93u 8450.26s

So I attached in another window:

(base) simsong@Seasons src % lldb bulk_extractor      (402-fix-J)bulk_extractor
(lldb) target create "bulk_extractor"
Current executable set to '/Users/simsong/gits/bulk_extractor/src/bulk_extractor' (arm64).
(lldb) attach 68295
Process 68295 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x0000000188e4fd88 libsystem_kernel.dylib`_kernelrpc_mach_vm_deallocate_trap + 8
libsystem_kernel.dylib`:
->  0x188e4fd88 <+8>: ret

libsystem_kernel.dylib`task_dyld_process_info_notify_get:
    0x188e4fd8c <+0>: mov    x16, #-0xd
    0x188e4fd90 <+4>: svc    #0x80
    0x188e4fd94 <+8>: ret
Target 0: (bulk_extractor) stopped.
(lldb) where
error: 'where' is not a valid command.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000188e4fd88 libsystem_kernel.dylib`_kernelrpc_mach_vm_deallocate_trap + 8
    frame #1: 0x0000000188e519dc libsystem_kernel.dylib`mach_vm_deallocate + 88
    frame #2: 0x0000000188cbdfdc libsystem_malloc.dylib`mvm_deallocate_pages + 144
    frame #3: 0x0000000188cbc4ec libsystem_malloc.dylib`free_large + 416
    frame #4: 0x0000000188cc667c libsystem_malloc.dylib`_szone_free + 720
    frame #5: 0x00000001022fbb88 bulk_extractor`bool std::__1::basic_regex<char, std::__1::regex_traits<char> >::__match_at_start_ecma<std::__1::allocator<std::__1::sub_match<char const*> > >(char const*, char const*, std::__1::match_results<char const*, std::__1::allocator<std::__1::sub_match<char const*> > >&, std::__1::regex_constants::match_flag_type, bool) const [inlined] void std::__1::__libcpp_operator_delete[abi:v15006]<void*>(__args=<unavailable>) at new:256:3 [opt]
    frame #6: 0x00000001022fbb84 bulk_extractor`bool std::__1::basic_regex<char, std::__1::regex_traits<char> >::__match_at_start_ecma<std::__1::allocator<std::__1::sub_match<char const*> > >(char const*, char const*, std::__1::match_results<char const*, std::__1::allocator<std::__1::sub_match<char const*> > >&, std::__1::regex_constants::match_flag_type, bool) const [inlined] void std::__1::__do_deallocate_handle_size[abi:v15006]<>(__ptr=<unavailable>, __size=<unavailable>) at new:280:10 [opt]
    frame #7: 0x00000001022fbb84 bulk_extractor`bool std::__1::basic_regex<char, std::__1::regex_traits<char> >::__match_at_start_ecma<std::__1::allocator<std::__1::sub_match<char const*> > >(char const*, char const*, std::__1::match_results<char const*, std::__1::allocator<std::__1::sub_match<char const*> > >&, std::__1::regex_constants::match_flag_type, bool) const [inlined] std::__1::__libcpp_deallocate[abi:v15006](__ptr=<unavailable>, __size=<unavailable>, __align=8) at new:296:14 [opt]
    frame #8: 0x00000001022fbb84 bulk_extractor`bool std::__1::basic_regex<char, std::__1::regex_traits<char> >::__match_at_start_ecma<std::__1::allocator<std::__1::sub_match<char const*> > >(char const*, char const*, std::__1::match_results<char const*, std::__1::allocator<std::__1::sub_match<char const*> > >&, std::__1::regex_constants::match_flag_type, bool) const [inlined] std::__1::allocator<std::__1::__state<char> >::deallocate[abi:v15006](this=0x000000016db17730, __p=<unavailable>, __n=<unavailable>) at allocator.h:128:13 [opt]
    frame #9: 0x00000001022fbb84 bulk_extractor`bool std::__1::basic_regex<char, std::__1::regex_traits<char> >::__match_at_start_ecma<std::__1::allocator<std::__1::sub_match<char const*> > >(char const*, char const*, std::__1::match_results<char const*, std::__1::allocator<std::__1::sub_match<char const*> > >&, std::__1::regex_constants::match_flag_type, bool) const [inlined] std::__1::allocator_traits<std::__1::allocator<std::__1::__state<char> > >::deallocate[abi:v15006](__a=0x000000016db17730, __p=<unavailable>, __n=<unavailable>) at allocator_traits.h:282:13 [opt]
    frame #10: 0x00000001022fbb84 bulk_extractor`bool std::__1::basic_regex<char, std::__1::regex_traits<char> >::__match_at_start_ecma<std::__1::allocator<std::__1::sub_match<char const*> > >(char const*, char const*, std::__1::match_results<char const*, std::__1::allocator<std::__1::sub_match<char const*> > >&, std::__1::regex_constants::match_flag_type, bool) const [inlined] std::__1::vector<std::__1::__state<char>, std::__1::allocator<std::__1::__state<char> > >::~vector[abi:v15006](this=0x000000016db17720 size=0) at vector:437:9 [opt]
    frame #11: 0x00000001022fbb34 bulk_extractor`bool std::__1::basic_regex<char, std::__1::regex_traits<char> >::__match_at_start_ecma<std::__1::allocator<std::__1::sub_match<char const*> > >(char const*, char const*, std::__1::match_results<char const*, std::__1::allocator<std::__1::sub_match<char const*> > >&, std::__1::regex_constants::match_flag_type, bool) const [inlined] std::__1::vector<std::__1::__state<char>, std::__1::allocator<std::__1::__state<char> > >::~vector[abi:v15006](this=0x000000016db17720 size=0) at vector:430:5 [opt]
    frame #12: 0x00000001022fbb34 bulk_extractor`bool std::__1::basic_regex<char, std::__1::regex_traits<char> >::__match_at_start_ecma<std::__1::allocator<std::__1::sub_match<char const*> > >(this=0x0000600003e44000, __firstlast="", __m=0x000000016db17860, __flags=<unavailable>, __at_first=<unavailable>) const at regex:5850:1 [opt]
    frame #13: 0x0000000102301f0c bulk_extractor`bool std::__1::basic_regex<char, std::__1::regex_traits<char> >::__search<std::__1::allocator<std::__1::sub_match<char const*> > >(char const*, char const*, std::__1::match_results<char const*, std::__1::allocator<std::__1::sub_match<char const*> > >&, std::__1::regex_constants::match_flag_type) const [inlined] bool std::__1::basic_regex<char, std::__1::regex_traits<char> >::__match_at_start<std::__1::allocator<std::__1::sub_match<char const*> > >(this=0x0000600003e44000, __firstlast="", __m=0x000000016db17860, __flags=match_prev_avail, __at_first=false) const at regex:6056:16 [opt]
    frame #14: 0x0000000102301ef0 bulk_extractor`bool std::__1::basic_regex<char, std::__1::regex_traits<char> >::__search<std::__1::allocator<std::__1::sub_match<char const*> > >(this=0x0000600003e44000, __firstlast="", __m=0x000000016db17860, __flags=match_prev_avail) const at regex:6090:17 [opt]
    frame #15: 0x0000000102307c74 bulk_extractor`regex_vector::search_all(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*, unsigned long*, unsigned long*) const [inlined] bool std::__1::regex_search[abi:v15006]<std::__1::char_traits<char>, std::__1::allocator<char>, std::__1::allocator<std::__1::sub_match<std::__1::__wrap_iter<char const*> > >, char, std::__1::regex_traits<char> >(__s="gD\xc9\xf3\x86\xa9f\x9e\x9e\xf5\x8c[Ź\xc99\xb1\xb7in\x9eN\xfb\U00000003\xb7\U00000006\xfb\x86\x8d9\U0000001d]9\xf5\x8c[ŹɞN\xb1\xb7i9\U0000001d\x869\xf5\x8c[Ź\xc99\xf5\x8c;\xc5 Dgɞxdd.\x85GGGGGG\xe6\xe44绤\x8fGѝ|\xf0\x945\xa04\x86\xcb\xe74n;\x879&IF8\xe0y<\ay\xae\xb7\vgi\x8a\xea\f#=\x8a\xe1X_\xfe\xdb@\U00000011\xed\U0000001b\U0000001a=\xd6\U0000007f\xa8\xbd\xa8\b\xc5~I\U00000012\xe0*sf\xf1GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG\x87Md\xe6GG\xec\xec\xe6G\U00000016N_\xebGG\xf1GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"..., __m=0x000000016db177f0, __e=0x0000600003e44000, __flags=match_default) at regex:6210:20 [opt]
    frame #16: 0x0000000102307c28 bulk_extractor`regex_vector::search_all(this=<unavailable>, probe="gD\xc9\xf3\x86\xa9f\x9e\x9e\xf5\x8c[Ź\xc99\xb1\xb7in\x9eN\xfb\U00000003\xb7\U00000006\xfb\x86\x8d9\U0000001d]9\xf5\x8c[ŹɞN\xb1\xb7i9\U0000001d\x869\xf5\x8c[Ź\xc99\xf5\x8c;\xc5 Dgɞxdd.\x85GGGGGG\xe6\xe44绤\x8fGѝ|\xf0\x945\xa04\x86\xcb\xe74n;\x879&IF8\xe0y<\ay\xae\xb7\vgi\x8a\xea\f#=\x8a\xe1X_\xfe\xdb@\U00000011\xed\U0000001b\U0000001a=\xd6\U0000007f\xa8\xbd\xa8\b\xc5~I\U00000012\xe0*sf\xf1GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG\x87Md\xe6GG\xec\xec\xe6G\U00000016N_\xebGG\xf1GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"..., found="", offset=0x000000016db17bb0, len=0x000000016db17a78) const at regex_vector.cpp:31:9 [opt]
    frame #17: 0x000000010236a934 bulk_extractor`::scan_find(sp=0x000000016db180a8) at scan_find.cpp:95:28 [opt]
    frame #18: 0x0000000102319b30 bulk_extractor`scanner_set::process_sbuf(this=0x000000016db1a800, sbufp=0x000000011df08f30, scanner=(bulk_extractor`::scan_find(scanner_params &) at scan_find.cpp:48))(scanner_params&)) at scanner_set.cpp:873:9 [opt]
    frame #19: 0x0000000102316e24 bulk_extractor`scanner_set::process_sbuf(this=0x000000016db1a800, sbufp=0x000000011df08f30) at scanner_set.cpp:1022:13 [opt]
    frame #20: 0x00000001023160d0 bulk_extractor`scanner_set::schedule_sbuf(this=0x000000016db1a800, sbufp=0x000000011df08f30) at scanner_set.cpp:684:9 [opt]
    frame #21: 0x000000010237aa30 bulk_extractor`::scan_outlook(sp=0x000000016db18708) at scan_outlook.cpp:83:16 [opt]
    frame #22: 0x0000000102319b30 bulk_extractor`scanner_set::process_sbuf(this=0x000000016db1a800, sbufp=0x000000011df088f0, scanner=(bulk_extractor`::scan_outlook(scanner_params &) at scan_outlook.cpp:61))(scanner_params&)) at scanner_set.cpp:873:9 [opt]
    frame #23: 0x0000000102316e24 bulk_extractor`scanner_set::process_sbuf(this=0x000000016db1a800, sbufp=0x000000011df088f0) at scanner_set.cpp:1022:13 [opt]
    frame #24: 0x00000001023160d0 bulk_extractor`scanner_set::schedule_sbuf(this=0x000000016db1a800, sbufp=0x000000011df088f0) at scanner_set.cpp:684:9 [opt]
    frame #25: 0x000000010234e6ec bulk_extractor`Phase1::read_process_sbufs(this=0x000000016db18e50) at phase1.cpp:207:24 [opt]
    frame #26: 0x000000010234faf0 bulk_extractor`Phase1::phase1_run(this=0x000000016db18e50) at phase1.cpp:290:5 [opt]
    frame #27: 0x0000000102330404 bulk_extractor`bulk_extractor_main(cout=<unavailable>, cerr=<unavailable>, argc=<unavailable>, argv=<unavailable>) at bulk_extractor.cpp:608:16 [opt]
    frame #28: 0x0000000188b37f28 dyld`start + 2236
(lldb)
simsong commented 1 year ago

Run a bit more and see where we get:

(lldb) cont
Process 68295 resuming
(lldb)
error: Process is running.  Use 'process interrupt' to pause execution.
(lldb) bt
error: Command requires a process which is currently stopped.
(lldb) process interrupt
Process 68295 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
    frame #0: 0x0000000188ccc460 libsystem_malloc.dylib`_nanov2_free + 284
libsystem_malloc.dylib`:
->  0x188ccc460 <+284>: cmp    w9, #0x7fc
    0x188ccc464 <+288>: b.eq   0x188ccc494               ; <+336>
    0x188ccc468 <+292>: b      0x188ccc478               ; <+308>
    0x188ccc46c <+296>: sub    w9, w9, #0x7fe
Target 0: (bulk_extractor) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
  * frame #0: 0x0000000188ccc460 libsystem_malloc.dylib`_nanov2_free + 284
    frame #1: 0x00000001022fba3c bulk_extractor`bool std::__1::basic_regex<char, std::__1::regex_traits<char> >::__match_at_start_ecma<std::__1::allocator<std::__1::sub_match<char const*> > >(this=0x0000600003e44000, __firstlast="", __m=0x000000016db17860, __flags=match_prev_avail, __at_first=<unavailable>) const at regex:0 [opt]
    frame #2: 0x0000000102301f0c bulk_extractor`bool std::__1::basic_regex<char, std::__1::regex_traits<char> >::__search<std::__1::allocator<std::__1::sub_match<char const*> > >(char const*, char const*, std::__1::match_results<char const*, std::__1::allocator<std::__1::sub_match<char const*> > >&, std::__1::regex_constants::match_flag_type) const [inlined] bool std::__1::basic_regex<char, std::__1::regex_traits<char> >::__match_at_start<std::__1::allocator<std::__1::sub_match<char const*> > >(this=0x0000600003e44000, __firstlast="", __m=0x000000016db17860, __flags=match_prev_avail, __at_first=false) const at regex:6056:16 [opt]
    frame #3: 0x0000000102301ef0 bulk_extractor`bool std::__1::basic_regex<char, std::__1::regex_traits<char> >::__search<std::__1::allocator<std::__1::sub_match<char const*> > >(this=0x0000600003e44000, __firstlast="", __m=0x000000016db17860, __flags=match_prev_avail) const at regex:6090:17 [opt]
    frame #4: 0x0000000102307c74 bulk_extractor`regex_vector::search_all(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> >*, unsigned long*, unsigned long*) const [inlined] bool std::__1::regex_search[abi:v15006]<std::__1::char_traits<char>, std::__1::allocator<char>, std::__1::allocator<std::__1::sub_match<std::__1::__wrap_iter<char const*> > >, char, std::__1::regex_traits<char> >(__s="gD\xc9\xf3\x86\xa9f\x9e\x9e\xf5\x8c[Ź\xc99\xb1\xb7in\x9eN\xfb\U00000003\xb7\U00000006\xfb\x86\x8d9\U0000001d]9\xf5\x8c[ŹɞN\xb1\xb7i9\U0000001d\x869\xf5\x8c[Ź\xc99\xf5\x8c;\xc5 Dgɞxdd.\x85GGGGGG\xe6\xe44绤\x8fGѝ|\xf0\x945\xa04\x86\xcb\xe74n;\x879&IF8\xe0y<\ay\xae\xb7\vgi\x8a\xea\f#=\x8a\xe1X_\xfe\xdb@\U00000011\xed\U0000001b\U0000001a=\xd6\U0000007f\xa8\xbd\xa8\b\xc5~I\U00000012\xe0*sf\xf1GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG\x87Md\xe6GG\xec\xec\xe6G\U00000016N_\xebGG\xf1GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"..., __m=0x000000016db177f0, __e=0x0000600003e44000, __flags=match_default) at regex:6210:20 [opt]
    frame #5: 0x0000000102307c28 bulk_extractor`regex_vector::search_all(this=<unavailable>, probe="gD\xc9\xf3\x86\xa9f\x9e\x9e\xf5\x8c[Ź\xc99\xb1\xb7in\x9eN\xfb\U00000003\xb7\U00000006\xfb\x86\x8d9\U0000001d]9\xf5\x8c[ŹɞN\xb1\xb7i9\U0000001d\x869\xf5\x8c[Ź\xc99\xf5\x8c;\xc5 Dgɞxdd.\x85GGGGGG\xe6\xe44绤\x8fGѝ|\xf0\x945\xa04\x86\xcb\xe74n;\x879&IF8\xe0y<\ay\xae\xb7\vgi\x8a\xea\f#=\x8a\xe1X_\xfe\xdb@\U00000011\xed\U0000001b\U0000001a=\xd6\U0000007f\xa8\xbd\xa8\b\xc5~I\U00000012\xe0*sf\xf1GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG\x87Md\xe6GG\xec\xec\xe6G\U00000016N_\xebGG\xf1GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG"..., found="", offset=0x000000016db17bb0, len=0x000000016db17a78) const at regex_vector.cpp:31:9 [opt]
    frame #6: 0x000000010236a934 bulk_extractor`::scan_find(sp=0x000000016db180a8) at scan_find.cpp:95:28 [opt]
    frame #7: 0x0000000102319b30 bulk_extractor`scanner_set::process_sbuf(this=0x000000016db1a800, sbufp=0x000000011df08f30, scanner=(bulk_extractor`::scan_find(scanner_params &) at scan_find.cpp:48))(scanner_params&)) at scanner_set.cpp:873:9 [opt]
    frame #8: 0x0000000102316e24 bulk_extractor`scanner_set::process_sbuf(this=0x000000016db1a800, sbufp=0x000000011df08f30) at scanner_set.cpp:1022:13 [opt]
    frame #9: 0x00000001023160d0 bulk_extractor`scanner_set::schedule_sbuf(this=0x000000016db1a800, sbufp=0x000000011df08f30) at scanner_set.cpp:684:9 [opt]
    frame #10: 0x000000010237aa30 bulk_extractor`::scan_outlook(sp=0x000000016db18708) at scan_outlook.cpp:83:16 [opt]
    frame #11: 0x0000000102319b30 bulk_extractor`scanner_set::process_sbuf(this=0x000000016db1a800, sbufp=0x000000011df088f0, scanner=(bulk_extractor`::scan_outlook(scanner_params &) at scan_outlook.cpp:61))(scanner_params&)) at scanner_set.cpp:873:9 [opt]
    frame #12: 0x0000000102316e24 bulk_extractor`scanner_set::process_sbuf(this=0x000000016db1a800, sbufp=0x000000011df088f0) at scanner_set.cpp:1022:13 [opt]
    frame #13: 0x00000001023160d0 bulk_extractor`scanner_set::schedule_sbuf(this=0x000000016db1a800, sbufp=0x000000011df088f0) at scanner_set.cpp:684:9 [opt]
    frame #14: 0x000000010234e6ec bulk_extractor`Phase1::read_process_sbufs(this=0x000000016db18e50) at phase1.cpp:207:24 [opt]
    frame #15: 0x000000010234faf0 bulk_extractor`Phase1::phase1_run(this=0x000000016db18e50) at phase1.cpp:290:5 [opt]
    frame #16: 0x0000000102330404 bulk_extractor`bulk_extractor_main(cout=<unavailable>, cerr=<unavailable>, argc=<unavailable>, argv=<unavailable>) at bulk_extractor.cpp:608:16 [opt]
    frame #17: 0x0000000188b37f28 dyld`start + 2236
(lldb)
simsong commented 1 year ago

Turns out that the -S ssn_mode=1 isn't necessary to replicate the crash, but the -e outlook is.

simsong commented 8 months ago

As of 7e2f14c814e86dd6ad30d7156055c2c3390d0cff, this command line no longer causes a hang:

$ src/bulk_extractor --notify_main_thread -S ssn_mode=1 -e outlook -x zip -x rar -x winpe -x exif -x pdf -J -d8 -o out1 -Z -F tests/patterns.txt tests/Images/nps-2010-emails.E01

simsong commented 8 months ago

I've now determined that this is a STL regex issue. Other people have encountered it.

simsong commented 8 months ago

So I think that I need to come up with a new approach for scan_find so that it doesn't send megabytes or gigabytes to std::regex_search.

It's going to have to be a lot smarter.

jonstewart commented 8 months ago

Oh! Is scan_find using C++’s std::regex in 2.0?It is notorious for poor performance. Can you describe the issue in more depth for me?We really aren’t too far away from a new lightgrep release in support of the lightgrep PR. Now that I’m through this weekend, I have crossed many things off my to-do list and have time to think and work.JonOn Jan 15, 2024, at 7:34 PM, Simson L. Garfinkel @.***> wrote: So I think that I need to come up with a new approach for scan_find so that it doesn't send megabytes or gigabytes to std::regex_search. It's going to have to be a lot smarter.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you are subscribed to this thread.Message ID: @.***>

simsong commented 8 months ago

My plan is to replace std::regex with RE2, for people who can't get lightgrep to work.

simsong commented 8 months ago

RE2 has been replaced and BE no longer hangs.