simsong / bulk_extractor

This is the development tree. Production downloads are at:
https://github.com/simsong/bulk_extractor/releases
Other
1.07k stars 185 forks source link

Address 2.0 performance issue #321

Closed simsong closed 2 years ago

simsong commented 2 years ago

BE2.0 is now highly reliable and compiles on AMD. Now it is time to address performance issues.

Something about the handling of data is much slower on 2.0. It may be all of the moderated sbuf access, and if so, we will need a way to disable all of the runtime checks.

Reading just the ubnist1 disk image with -x all, BE1.6 reads in 4.2 seconds on my macMini and BE2.0 reads in 4.0 seconds.

Running with -E json with 6 cores produces speeds of BE16: 5.5, 5.5, 5.4 and BE2.0: 9.0, 9.0, 9.0 Running with -E json -j1 produces BE16: 18.9, 18.7 and BE2.0: 42, 42

So we have an easy test case with a scanner that's easy to work with. This should be relatively straightforward to debug. @jonstewart if you have any suggestions, please let me know. Otherwise I'll probably follow this approach:

jonstewart commented 2 years ago

Two things:

Desperately hoping for schools to reopen soon so I can assist…

On Jan 5, 2022, at 8:29 AM, Simson L. Garfinkel @.***> wrote:

 BE2.0 is now highly reliable and compiles on AMD. Now it is time to address performance issues.

Something about the handling of data is much slower on 2.0. It may be all of the moderated sbuf access, and if so, we will need a way to disable all of the runtime checks.

Reading just the ubnist1 disk image with -x all, BE1.6 reads in 4.2 seconds on my macMini and BE2.0 reads in 4.0 seconds.

Running with -E json with 6 cores produces speeds of BE16: 5.5, 5.5, 5.4 and BE2.0: 9.0, 9.0, 9.0 Running with -E json -j1 produces BE16: 18.9, 18.7 and BE2.0: 42, 42

So we have an easy test case with a scanner that's easy to work with. This should be relatively straightforward to debug. @jonstewart if you have any suggestions, please let me know. Otherwise I'll probably follow this approach:

Add a DEBUG flag for scan_json to have it just return, to have it compute the histogram with the sbuf machinery and return, and to have it access every byte with the sbuf machinery and return. Create a performance benchmark script that can reliably benchmark two executables and produce summary results. — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.

simsong commented 2 years ago

Gprof is the # 2 approach if simple code inspection and some #ifdef statements fail. The overhead is a factor of 2 and it shows up when running single threaded, so I suspect it is one thing and easy to find.

I really suspect is is the super-safe sbuf system I created. Too many runtime checks…. But now I can work on efficiency given that I have correctness

simsong commented 2 years ago

Well look at this:

configure: *************************************
configure: *************************************
configure:   PACKAGE_NAME:     BULK_EXTRACTOR
configure:   PACKAGE_VERSION:  2.0.0-beta3
configure:   CC:               gcc
configure:   CXX:              g++ -std=c++17
configure:   CPPFLAGS:         -I/opt/local/include  -I/usr/local/include -DUTC_OFFSET=+0000
configure:   CFLAGS:           -g -g -O3 -fPIC
configure:   CXXFLAGS:         -g  -pthread -Wall -MD -Wpointer-arith -Wshadow -Wwrite-strings -Wcast-align -Wredundant-decls -Wdisabled-optimization -Wfloat-equal -Wmultichar -Wmissing-noreturn -Woverloaded-virtual -Wsign-promo -funit-at-a-time -Weffc++ -fPIC
configure:   LIBS:             -lewf -lstdc++ -lexpat -lz -ldl -ltermcap -lsqlite3 -lexpat
configure:   LDFLAGS:          -L/opt/local/lib  -L/usr/local/lib

Turns out that the rewritten autoconf file was not running the optimizer when compiling CXXFLAGS.

I've now gone through the scan_aes and discovered it was going to the end of the buffer, rather than the end of the page. I changed that and its going much faster now too.

jonstewart commented 2 years ago

🎉

On Jan 5, 2022, at 6:26 PM, Simson L. Garfinkel @.***> wrote:

 Well look at this:

configure: configure: configure: PACKAGE_NAME: BULK_EXTRACTOR configure: PACKAGE_VERSION: 2.0.0-beta3 configure: CC: gcc configure: CXX: g++ -std=c++17 configure: CPPFLAGS: -I/opt/local/include -I/usr/local/include -DUTC_OFFSET=+0000 configure: CFLAGS: -g -g -O3 -fPIC configure: CXXFLAGS: -g -pthread -Wall -MD -Wpointer-arith -Wshadow -Wwrite-strings -Wcast-align -Wredundant-decls -Wdisabled-optimization -Wfloat-equal -Wmultichar -Wmissing-noreturn -Woverloaded-virtual -Wsign-promo -funit-at-a-time -Weffc++ -fPIC configure: LIBS: -lewf -lstdc++ -lexpat -lz -ldl -ltermcap -lsqlite3 -lexpat configure: LDFLAGS: -L/opt/local/lib -L/usr/local/lib Turns out that the rewritten autoconf file was not running the optimizer when compiling CXXFLAGS.

CXXFLAGS fixed. I've now gone through the scan_aes and discovered it was going to the end of the buffer, rather than the end of the page. I changed that and its going much faster now too.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.

simsong commented 2 years ago

Discovery #2: margin size got inadvertently increased from 4MiB to 16MiB:

(base) simsong@nimi debug % ~/gits/bulk_extractor-1.6/src/bulk_extractor -h | grep margin
   -g NN        - specify margin (default 4194304)
(base) simsong@nimi debug %

vs.

~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o    23.51s user 0.44s system 104% cpu 22.889 total
(base) simsong@nimi ~ % ~/gits/bulk_extractor/src/bulk_extractor -h | grep margin
  -g, --marginsize arg          margin size in bytes (default: 16777216)
(base) simsong@nimi ~ %

And, get this --- the flex scanners don't respect the pagesize, and appear to scan to the end of the margin.

jonstewart commented 2 years ago

Hasn't that always been a problem with the flex scanners? They have to keep reading past the pagesize to account for any boundary hits, but there's not a great way of telling them to stop looking for new hits. That's a bit of cleverness in lightgrep of which I'm unduly proud, that you can provide more input to lg_closeout_search() to resolve any pending hits from previous input, but won't start new hits; and once anything pending has been resolved, it exits. Typically that just needs a couple bytes of input to get back to the zero state.

Is it possible to recognize transitioning to the zero state with flex, so you could throw an exception early and not have to wait/hope for a full matching hit in the margin?

On Wed, Jan 5, 2022 at 10:27 PM Simson L. Garfinkel < @.***> wrote:

Discovery #2: margin size got inadvertently increased from 4MiB to 16MiB:

(base) @. debug % ~/gits/bulk_extractor-1.6/src/bulk_extractor -h | grep margin -g NN - specify margin (default 4194304) (base) @. debug %

vs.

~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o 23.51s user 0.44s system 104% cpu 22.889 total (base) @. ~ % ~/gits/bulk_extractor/src/bulk_extractor -h | grep margin -g, --marginsize arg margin size in bytes (default: 16777216) (base) @. ~ %

And, get this --- the flex scanners don't respect the pagesize, and appear to scan to the end of the margin.

  • Modify flex scanners so that if a match starts inside the margin it throws an exception that ends the flex scanner.
  • Update BE2 codebase so that the default margin is 4MiB and not 16MiB.

— Reply to this email directly, view it on GitHub https://github.com/simsong/bulk_extractor/issues/321#issuecomment-1006256292, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABKVVNGOKXY6XAAIX4LRYLUUUD2DANCNFSM5LJ5DAAA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

simsong commented 2 years ago

I just found a way to tell them to stop when the regular expression scanner crosses into the margin. You'll like it.

simsong commented 2 years ago

Before implementing new margin hack (and before shortening the margin from 16MiB to 4MiB):

(base) simsong@nimi ~ % time ~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o ~/out/out-be20-xall-$(python -c 'import time;print(int(time.time()))') ~/corp/nps-2009-ubnist1/ubnist1.gen3.raw >/dev/null
~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o   > /dev/null  138.56s user 3.20s system 102% cpu 2:18.54 total

After:

(base) simsong@nimi ~ % time ~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o ~/out/out-be20-xall-$(python -c 'import time;print(int(time.time()))') ~/corp/nps-2009-ubnist1/ubnist1.gen3.raw
~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o    80.65s user 2.69s system 103% cpu 1:20.44 total

Improvement from 138.56s to 80.65s of user time. It's still not as good as BE1.6, probably because of the bigger margin size....

Check out 4007d30

simsong commented 2 years ago

Now with the margin lowered from 16MiB to 4MiB:

~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o    76.66s user 2.25s system 103% cpu 1:16.38 total

**LOOK AT THAT*** -- It's even faster than version 1.6. About time. That's the exception logic. Now to add it to scan_accts.

simsong commented 2 years ago

Ick. Exception causes allocated values not to be freed by the lexer.

https://github.com/simsong/bulk_extractor/runs/4723066087?check_suite_focus=true

Looks like I'll need to just move the point to the end.

simsong commented 2 years ago

I've run many performance tests now with BE16 vs. BE20beta4 (which includes the flex improvements noted above).

The table below is all done with the domexusers disk. All are run -j1 which means one thread reading data and one thread processing, which allows very good measurements of time. Each is run with a single scanner. The first, -x all, has no scanners enabled. It shows that that the underlying infrastructure is 2x faster on BE20b4. However, it's also clear that some of the scanners are much slower. Really, only three scanners: the net scanner and two of the flex-based scanners (but not the other two?).

scanner BE16 BE20b4
none 301 141
json 307 274
aes 375 341
gzip 294 276
base64 301 267
zip 298 275
gps 239 239
accts 431 1510
net 461 1030
email 829 1882

With the recursive analysis, the extra time spent in the big scanners (accts, net, email) gets multiplied multiple times.

So I'm nearly there, and the new infrastructure is amazing, and by throwing a lot of cores at the thing, BE2 is faster than BE1. But once I get the net, email and accts scanner faster, BE2 will be much faster than BE1. So I'm going to close this thread and open a thread for net , accts, and email.