Address 2.0 performance issue

simsong commented 2 years ago

BE2.0 is now highly reliable and compiles on AMD. Now it is time to address performance issues.

Something about the handling of data is much slower on 2.0. It may be all of the moderated sbuf access, and if so, we will need a way to disable all of the runtime checks.

Reading just the ubnist1 disk image with -x all, BE1.6 reads in 4.2 seconds on my macMini and BE2.0 reads in 4.0 seconds.

Running with -E json with 6 cores produces speeds of BE16: 5.5, 5.5, 5.4 and BE2.0: 9.0, 9.0, 9.0 Running with -E json -j1 produces BE16: 18.9, 18.7 and BE2.0: 42, 42

So we have an easy test case with a scanner that's easy to work with. This should be relatively straightforward to debug. @jonstewart if you have any suggestions, please let me know. Otherwise I'll probably follow this approach:

[ ] Add a DEBUG flag for scan_json to have it just return, to have it compute the histogram with the sbuf machinery and return, and to have it access every byte with the sbuf machinery and return.
[ ] Create a performance benchmark script that can reliably benchmark two executables and produce summary results.

jonstewart commented 2 years ago

Two things:

first, be sure you’re apples-to-apples with compiler optimization flags between the two builds (e.g., -O2). This is obvious, but obvious things make good checklist items, and the number of times I’ve had devs waste hours/days on performance degradation issues only to smack their heads and exclaim “oh I wasn’t building in release mode”…
- second, use source-level profiling (gprof? My attempt with Google’s libprofile didn’t work well, but it’s finicky) between the two builds so that you know where b_e 2.0 is spending its time. Source-level profiling would immediately confirm your hypothesis about sbuf, and may also reveal unexpected details (e.g., an inordinate amount of time spent waiting on mutexes).

Desperately hoping for schools to reopen soon so I can assist…

On Jan 5, 2022, at 8:29 AM, Simson L. Garfinkel @.***> wrote:

BE2.0 is now highly reliable and compiles on AMD. Now it is time to address performance issues.

Something about the handling of data is much slower on 2.0. It may be all of the moderated sbuf access, and if so, we will need a way to disable all of the runtime checks.

Reading just the ubnist1 disk image with -x all, BE1.6 reads in 4.2 seconds on my macMini and BE2.0 reads in 4.0 seconds.

Running with -E json with 6 cores produces speeds of BE16: 5.5, 5.5, 5.4 and BE2.0: 9.0, 9.0, 9.0 Running with -E json -j1 produces BE16: 18.9, 18.7 and BE2.0: 42, 42

So we have an easy test case with a scanner that's easy to work with. This should be relatively straightforward to debug. @jonstewart if you have any suggestions, please let me know. Otherwise I'll probably follow this approach:

Add a DEBUG flag for scan_json to have it just return, to have it compute the histogram with the sbuf machinery and return, and to have it access every byte with the sbuf machinery and return. Create a performance benchmark script that can reliably benchmark two executables and produce summary results. — Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.

simsong commented 2 years ago

Gprof is the # 2 approach if simple code inspection and some #ifdef statements fail. The overhead is a factor of 2 and it shows up when running single threaded, so I suspect it is one thing and easy to find.

I really suspect is is the super-safe sbuf system I created. Too many runtime checks…. But now I can work on efficiency given that I have correctness

simsong commented 2 years ago

Well look at this:

configure: *************************************
configure: *************************************
configure:   PACKAGE_NAME:     BULK_EXTRACTOR
configure:   PACKAGE_VERSION:  2.0.0-beta3
configure:   CC:               gcc
configure:   CXX:              g++ -std=c++17
configure:   CPPFLAGS:         -I/opt/local/include  -I/usr/local/include -DUTC_OFFSET=+0000
configure:   CFLAGS:           -g -g -O3 -fPIC
configure:   CXXFLAGS:         -g  -pthread -Wall -MD -Wpointer-arith -Wshadow -Wwrite-strings -Wcast-align -Wredundant-decls -Wdisabled-optimization -Wfloat-equal -Wmultichar -Wmissing-noreturn -Woverloaded-virtual -Wsign-promo -funit-at-a-time -Weffc++ -fPIC
configure:   LIBS:             -lewf -lstdc++ -lexpat -lz -ldl -ltermcap -lsqlite3 -lexpat
configure:   LDFLAGS:          -L/opt/local/lib  -L/usr/local/lib

Turns out that the rewritten autoconf file was not running the optimizer when compiling CXXFLAGS.

[x] CXXFLAGS fixed.

I've now gone through the scan_aes and discovered it was going to the end of the buffer, rather than the end of the page. I changed that and its going much faster now too.

jonstewart commented 2 years ago

🎉

On Jan 5, 2022, at 6:26 PM, Simson L. Garfinkel @.***> wrote:

Well look at this:

configure: configure: configure: PACKAGE_NAME: BULK_EXTRACTOR configure: PACKAGE_VERSION: 2.0.0-beta3 configure: CC: gcc configure: CXX: g++ -std=c++17 configure: CPPFLAGS: -I/opt/local/include -I/usr/local/include -DUTC_OFFSET=+0000 configure: CFLAGS: -g -g -O3 -fPIC configure: CXXFLAGS: -g -pthread -Wall -MD -Wpointer-arith -Wshadow -Wwrite-strings -Wcast-align -Wredundant-decls -Wdisabled-optimization -Wfloat-equal -Wmultichar -Wmissing-noreturn -Woverloaded-virtual -Wsign-promo -funit-at-a-time -Weffc++ -fPIC configure: LIBS: -lewf -lstdc++ -lexpat -lz -ldl -ltermcap -lsqlite3 -lexpat configure: LDFLAGS: -L/opt/local/lib -L/usr/local/lib Turns out that the rewritten autoconf file was not running the optimizer when compiling CXXFLAGS.

CXXFLAGS fixed. I've now gone through the scan_aes and discovered it was going to the end of the buffer, rather than the end of the page. I changed that and its going much faster now too.

— Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android. You are receiving this because you were mentioned.

simsong commented 2 years ago

Discovery #2: margin size got inadvertently increased from 4MiB to 16MiB:

(base) simsong@nimi debug % ~/gits/bulk_extractor-1.6/src/bulk_extractor -h | grep margin
   -g NN        - specify margin (default 4194304)
(base) simsong@nimi debug %

vs.

~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o    23.51s user 0.44s system 104% cpu 22.889 total
(base) simsong@nimi ~ % ~/gits/bulk_extractor/src/bulk_extractor -h | grep margin
  -g, --marginsize arg          margin size in bytes (default: 16777216)
(base) simsong@nimi ~ %

And, get this --- the flex scanners don't respect the pagesize, and appear to scan to the end of the margin.

[ ] Modify flex scanners so that if a match starts inside the margin it throws an exception that ends the flex scanner.
[ ] Update BE2 codebase so that the default margin is 4MiB and not 16MiB.

jonstewart commented 2 years ago

Hasn't that always been a problem with the flex scanners? They have to keep reading past the pagesize to account for any boundary hits, but there's not a great way of telling them to stop looking for new hits. That's a bit of cleverness in lightgrep of which I'm unduly proud, that you can provide more input to lg_closeout_search() to resolve any pending hits from previous input, but won't start new hits; and once anything pending has been resolved, it exits. Typically that just needs a couple bytes of input to get back to the zero state.

Is it possible to recognize transitioning to the zero state with flex, so you could throw an exception early and not have to wait/hope for a full matching hit in the margin?

On Wed, Jan 5, 2022 at 10:27 PM Simson L. Garfinkel < @.***> wrote:

Discovery #2: margin size got inadvertently increased from 4MiB to 16MiB:

(base) @. debug % ~/gits/bulk_extractor-1.6/src/bulk_extractor -h | grep margin -g NN - specify margin (default 4194304) (base) @. debug %

vs.

~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o 23.51s user 0.44s system 104% cpu 22.889 total (base) @. ~ % ~/gits/bulk_extractor/src/bulk_extractor -h | grep margin -g, --marginsize arg margin size in bytes (default: 16777216) (base) @. ~ %

And, get this --- the flex scanners don't respect the pagesize, and appear to scan to the end of the margin.

Modify flex scanners so that if a match starts inside the margin it throws an exception that ends the flex scanner.

Update BE2 codebase so that the default margin is 4MiB and not 16MiB.

— Reply to this email directly, view it on GitHub https://github.com/simsong/bulk_extractor/issues/321#issuecomment-1006256292, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABKVVNGOKXY6XAAIX4LRYLUUUD2DANCNFSM5LJ5DAAA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

simsong commented 2 years ago

I just found a way to tell them to stop when the regular expression scanner crosses into the margin. You'll like it.

simsong commented 2 years ago

Before implementing new margin hack (and before shortening the margin from 16MiB to 4MiB):

(base) simsong@nimi ~ % time ~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o ~/out/out-be20-xall-$(python -c 'import time;print(int(time.time()))') ~/corp/nps-2009-ubnist1/ubnist1.gen3.raw >/dev/null
~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o   > /dev/null  138.56s user 3.20s system 102% cpu 2:18.54 total

After:

(base) simsong@nimi ~ % time ~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o ~/out/out-be20-xall-$(python -c 'import time;print(int(time.time()))') ~/corp/nps-2009-ubnist1/ubnist1.gen3.raw
~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o    80.65s user 2.69s system 103% cpu 1:20.44 total

Improvement from 138.56s to 80.65s of user time. It's still not as good as BE1.6, probably because of the bigger margin size....

Check out 4007d30

simsong commented 2 years ago

Now with the margin lowered from 16MiB to 4MiB:

~/gits/bulk_extractor/src/bulk_extractor -j1 -1 -E email -o    76.66s user 2.25s system 103% cpu 1:16.38 total

**LOOK AT THAT*** -- It's even faster than version 1.6. About time. That's the exception logic. Now to add it to scan_accts.

simsong commented 2 years ago

Ick. Exception causes allocated values not to be freed by the lexer.

https://github.com/simsong/bulk_extractor/runs/4723066087?check_suite_focus=true

Looks like I'll need to just move the point to the end.

simsong commented 2 years ago

I've run many performance tests now with BE16 vs. BE20beta4 (which includes the flex improvements noted above).

The table below is all done with the domexusers disk. All are run -j1 which means one thread reading data and one thread processing, which allows very good measurements of time. Each is run with a single scanner. The first, -x all, has no scanners enabled. It shows that that the underlying infrastructure is 2x faster on BE20b4. However, it's also clear that some of the scanners are much slower. Really, only three scanners: the net scanner and two of the flex-based scanners (but not the other two?).

scanner	BE16	BE20b4
none	301	141
json	307	274
aes	375	341
gzip	294	276
base64	301	267
zip	298	275
gps	239	239
accts	431	1510
net	461	1030
email	829	1882

With the recursive analysis, the extra time spent in the big scanners (accts, net, email) gets multiplied multiple times.

So I'm nearly there, and the new infrastructure is amazing, and by throwing a lot of cores at the thing, BE2 is faster than BE1. But once I get the net, email and accts scanner faster, BE2 will be much faster than BE1. So I'm going to close this thread and open a thread for net , accts, and email.

simsong / bulk_extractor

Address 2.0 performance issue #321