mpope9 / exor_filter

Erlang nif for xor_filter. 'Faster and Smaller Than Bloom and Cuckoo Filters'.
Apache License 2.0
31 stars 3 forks source link

eunit tests segfault #17

Closed JayKickliter closed 3 years ago

JayKickliter commented 3 years ago

It looks like problem began with 037900f10f2fcf0edd0c9e448a474720b44152f9.

The error:

$ ./rebar3 eunit
    <TRIMMED>
    exor_filter_test:66: basic_test_...ok
    exor_filter_test:67: basic_test_...[0.025 s] ok
    exor_filter_test:68: basic_test_...Segmentation fault (core dumped)

Compiling and running the tests with address santizer (ASan) gives us a little more info.

Modifications:

diff --git a/c_src/Makefile b/c_src/Makefile
index 0c18c08..f833819 100644
--- a/c_src/Makefile
+++ b/c_src/Makefile
@@ -31,11 +31,11 @@ else ifeq ($(UNAME_SYS), Linux)
    CXXFLAGS ?= -O3 -finline-functions -Wall
 endif

-CFLAGS += -fPIC -I $(ERTS_INCLUDE_DIR) -I $(ERL_INTERFACE_INCLUDE_DIR)
-CXXFLAGS += -fPIC -I $(ERTS_INCLUDE_DIR) -I $(ERL_INTERFACE_INCLUDE_DIR)
+CFLAGS += -fPIC -I $(ERTS_INCLUDE_DIR) -I $(ERL_INTERFACE_INCLUDE_DIR) -fsanitize=address -ggdb3 -fno-omit-frame-pointer
+CXXFLAGS += -fPIC -I $(ERTS_INCLUDE_DIR) -I $(ERL_INTERFACE_INCLUDE_DIR) -fsanitize=address -ggdb3 -fno-omit-frame-pointer

 LDLIBS += -L $(ERL_INTERFACE_LIB_DIR) -lei
-LDFLAGS += -shared
+LDFLAGS += -shared -fsanitize=address

 # Verbosity.

Errors (they are non-deterministic and change from run to run):

$ LD_PRELOAD="$(cc --print-file libasan.so)" ASAN_OPTIONS=detect_leaks=0 ./rebar3 eunit
======================== EUnit ========================
file "exor_filter.app"
  application 'exor_filter'
    module 'exor_filter'
    module 'xor16'
    module 'xor8'
    [done in 0.011 s]
  [done in 0.019 s]
module 'exor_filter_test'
  xor8 Test Group
    exor_filter_test:15: basic_test_...ok
    exor_filter_test:16: basic_test_...ok
    exor_filter_test:17: basic_test_...ok
    exor_filter_test:18: basic_test_...ok
    exor_filter_test:19: basic_test_...ok
    exor_filter_test:20: basic_test_...Too many iterations. Are all your keys unique?ok
    exor_filter_test:21: basic_test_...Too many iterations. Are all your keys unique?ok
    exor_filter_test:22: basic_test_...ok
    exor_filter_test:23: basic_test_...ok
    exor_filter_test:24: basic_test_...ok
    exor_filter_test:25: basic_test_...ok
    exor_filter_test:26: basic_test_...ok
    exor_filter_test:27: basic_test_...ok
    exor_filter_test:28: basic_test_...ok
    exor_filter_test:29: basic_test_...ok
    exor_filter_test:30: basic_test_...ok
    exor_filter_test:31: basic_test_...ok
    exor_filter_test:32: basic_test_...ok
    exor_filter_test:33: basic_test_...ok
    exor_filter_test:34: basic_test_...[2.261 s] ok
    exor_filter_test:35: basic_test_...[1.721 s] ok
    exor_filter_test:36: basic_test_...[0.032 s] ok
    exor_filter_test:37: basic_test_...[0.038 s] ok
    exor_filter_test:38: basic_test_...ok
    exor_filter_test:39: basic_test_...[0.025 s] ok
    exor_filter_test:40: basic_test_...[0.025 s] ok
    exor_filter_test:41: basic_test_...[0.004 s] ok
    exor_filter_test:42: basic_test_...[0.022 s] ok
    exor_filter_test:43: basic_test_...[0.053 s] ok
    exor_filter_test:44: basic_test_...[0.159 s] ok
    [done in 4.430 s]
  xor16 Test Group
    exor_filter_test:48: basic_test_...ok
    exor_filter_test:49: basic_test_...ok
    exor_filter_test:50: basic_test_...ok
    exor_filter_test:51: basic_test_...ok
    exor_filter_test:52: basic_test_...ok
    exor_filter_test:53: basic_test_...Too many iterations. Are all your keys unique?ok
    exor_filter_test:54: basic_test_...Too many iterations. Are all your keys unique?ok
    exor_filter_test:55: basic_test_...ok
    exor_filter_test:56: basic_test_...ok
    exor_filter_test:57: basic_test_...ok
    exor_filter_test:58: basic_test_...ok
    exor_filter_test:59: basic_test_...ok
    exor_filter_test:60: basic_test_...ok
    exor_filter_test:61: basic_test_...ok
    exor_filter_test:62: basic_test_...ok
    exor_filter_test:63: basic_test_...ok
    exor_filter_test:64: basic_test_...ok
    exor_filter_test:65: basic_test_...ok
    exor_filter_test:66: basic_test_...ok
    exor_filter_test:67: basic_test_...[0.026 s] ok
    exor_filter_test:68: basic_test_...ASAN:DEADLYSIGNAL
=================================================================
==28642==ERROR: AddressSanitizer: SEGV on unknown address 0x7de708dc0018 (pc 0x5634a64aab55 bp 0x7fd50697e700 sp 0x7fd50697e6b0 T4)
==28642==The signal is caused by a READ memory access.
    #0 0x5634a64aab54 in ddq_enqueue beam/erl_alloc_util.c:1949
    #1 0x5634a64aab54 in enqueue_dealloc_other_instance beam/erl_alloc_util.c:2254
    #2 0x5634a64aab54 in erts_alcu_free_thr_pref beam/erl_alloc_util.c:6068
    #3 0x5634a65aed2e in erts_free beam/erl_alloc.h:281
    #4 0x5634a65aed2e in enif_free beam/erl_nif.c:526
    #5 0x5634a65af3dc in run_resource_dtor beam/erl_nif.c:2529
    #6 0x5634a646bdf1 in handle_misc_aux_work beam/erl_process.c:1853
    #7 0x5634a646bdf1 in handle_aux_work beam/erl_process.c:2638
    #8 0x5634a64677bd in scheduler_wait beam/erl_process.c:3402
    #9 0x5634a6466b6a in erts_schedule beam/erl_process.c:9543
    #10 0x5634a6459409 in process_main beam/beam_emu.c:684
    #11 0x5634a6475a38 in sched_thread_func beam/erl_process.c:8498
    #12 0x5634a666879c in thr_wrapper pthread/ethread.c:118
    #13 0x7fd54c7396da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)
    #14 0x7fd54c25aa3e in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x121a3e)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV beam/erl_alloc_util.c:1949 in ddq_enqueue
Thread T4 (1_scheduler) created by T0 here:
    #0 0x7fd54d371d2f in __interceptor_pthread_create (/usr/lib/gcc/x86_64-linux-gnu/7/libasan.so+0x37d2f)
    #1 0x5634a6668a12 in ethr_thr_create pthread/ethread.c:400

==28642==ABORTING
    exor_filter_test:68: basic_test_...ASAN:DEADLYSIGNAL
=================================================================
==28708==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000074 (pc 0x55e57b607af9 bp 0x7f4c1797e700 sp 0x7f4c1797e6b0 T4)
==28708==The signal is caused by a READ memory access.
==28708==Hint: address points to the zero page.
    #0 0x55e57b607af8 in erts_alcu_free_thr_pref beam/erl_alloc_util.c:6067
    #1 0x55e57b70bd2e in erts_free beam/erl_alloc.h:281
    #2 0x55e57b70bd2e in enif_free beam/erl_nif.c:526
    #3 0x55e57b70c3dc in run_resource_dtor beam/erl_nif.c:2529
    #4 0x55e57b5c8df1 in handle_misc_aux_work beam/erl_process.c:1853
    #5 0x55e57b5c8df1 in handle_aux_work beam/erl_process.c:2638
    #6 0x55e57b5c47bd in scheduler_wait beam/erl_process.c:3402
    #7 0x55e57b5c3b6a in erts_schedule beam/erl_process.c:9543
    #8 0x55e57b5b6409 in process_main beam/beam_emu.c:684
    #9 0x55e57b5d2a38 in sched_thread_func beam/erl_process.c:8498
    #10 0x55e57b7c579c in thr_wrapper pthread/ethread.c:118
    #11 0x7f4c5d70b6da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)
    #12 0x7f4c5d22ca3e in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x121a3e)

AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV beam/erl_alloc_util.c:6067 in erts_alcu_free_thr_pref
Thread T4 (1_scheduler) created by T0 here:
    #0 0x7f4c5e343d2f in __interceptor_pthread_create (/usr/lib/gcc/x86_64-linux-gnu/7/libasan.so+0x37d2f)
    #1 0x55e57b7c5a12 in ethr_thr_create pthread/ethread.c:400

==28708==ABORTING

Commenting out malloc and free macros in xor_filter_nif.c

/* #define malloc(size) enif_alloc(size) */
/* #define free(size) enif_free(size) */

gives more insightful info, but I don't whether or not I introduced a new bug causing the NIF to free memory allocated with enif_alloc or vice versa:

    exor_filter_test:68: basic_test_...=================================================================
==30750==ERROR: AddressSanitizer: attempting free on address which was not malloc()-ed: 0x7fa7d8e486f8 in thread T4 (1_scheduler)
    #0 0x7fa8206d37a8 in __interceptor_free (/usr/lib/gcc/x86_64-linux-gnu/7/libasan.so+0xde7a8)
    #1 0x7fa7d6a1cafe in xor8_free exor_filter/c_src/xorfilter.h:165
    #2 0x7fa7d6a1cafe in destroy_xor8_filter_resource exor_filter/c_src/xor_filter_nif.c:69
    #3 0x55ab855ae3dc in run_resource_dtor beam/erl_nif.c:2529
[0.036 s] ok
    exor_filter_test:69: basic_test_...    #4 0x55ab8546adf1 in handle_misc_aux_work beam/erl_process.c:1853
    #5 0x55ab8546adf1 in handle_aux_work beam/erl_process.c:2638
    #6 0x55ab854667bd in scheduler_wait beam/erl_process.c:3402
    #7 0x55ab85465b6a in erts_schedule beam/erl_process.c:9543
    #8 0x55ab85458409 in process_main beam/beam_emu.c:684
    #9 0x55ab85474a38 in sched_thread_func beam/erl_process.c:8498
    #10 0x55ab8566779c in thr_wrapper pthread/ethread.c:118
    #11 0x7fa81f9f46da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)
    #12 0x7fa81f515a3e in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x121a3e)

Address 0x7fa7d8e486f8 is a wild pointer.
SUMMARY: AddressSanitizer: bad-free (/usr/lib/gcc/x86_64-linux-gnu/7/libasan.so+0xde7a8) in __interceptor_free
Thread T4 (1_scheduler) created by T0 here:
    #0 0x7fa82062cd2f in __interceptor_pthread_create (/usr/lib/gcc/x86_64-linux-gnu/7/libasan.so+0x37d2f)
    #1 0x55ab85667a12 in ethr_thr_create pthread/ethread.c:400

==30750==ABORTING
JayKickliter commented 3 years ago

On a couple runs with with ASan it reported double free:


=================================================================
==30946==ERROR: AddressSanitizer: attempting double-free on 0x7f77284801d8 in thread T4 (1_scheduler):
    #0 0x7f776cfb97a8 in __interceptor_free (/usr/lib/gcc/x86_64-linux-gnu/7/libasan.so+0xde7a8)
    #1 0x7f77227e6afe in xor8_free /home/jay/repos/exor_filter/c_src/xorfilter.h:165
    #2 0x7f77227e6afe in destroy_xor8_filter_resource exor_filter/c_src/xor_filter_nif.c:69
    #3 0x5612138c53dc in run_resource_dtor beam/erl_nif.c:2529
    #4 0x561213781df1 in handle_misc_aux_work beam/erl_process.c:1853
    #5 0x561213781df1 in handle_aux_work beam/erl_process.c:2638
    #6 0x56121377d7bd in scheduler_wait beam/erl_process.c:3402
    #7 0x56121377cb6a in erts_schedule beam/erl_process.c:9543
    #8 0x56121376f409 in process_main beam/beam_emu.c:684
    #9 0x56121378ba38 in sched_thread_func beam/erl_process.c:8498
    #10 0x56121397e79c in thr_wrapper pthread/ethread.c:118
    #11 0x7f776c2da6da in start_thread (/lib/x86_64-linux-gnu/libpthread.so.0+0x76da)
    #12 0x7f776bdfba3e in __clone (/lib/x86_64-linux-gnu/libc.so.6+0x121a3e)

0x7f77284801d8 is located 472 bytes inside of 32768-byte region [0x7f7728480000,0x7f7728488000)
allocated by thread T0 here:
    #0 0x7f776cfba790 in posix_memalign (/usr/lib/gcc/x86_64-linux-gnu/7/libasan.so+0xdf790)
    #1 0x5612138ecddb in erts_sys_aligned_alloc sys/unix/sys.c:819
    #2 0x5612137b5951 in erts_alcu_sys_alloc beam/erl_alloc_util.c:1111
    #3 0x5612137b8eed in create_carrier beam/erl_alloc_util.c:4262
    #4 0x5612137c1af2 in erts_alcu_start beam/erl_alloc_util.c:6782
    #5 0x5612138d2c78 in erts_aoffalc_start beam/erl_ao_firstfit_alloc.c:343
    #6 0x5612137aad57 in start_au_allocator beam/erl_alloc.c:1130
    #7 0x5612137abbbe in erts_alloc_init beam/erl_alloc.c:888
    #8 0x5612137c6698 in early_init beam/erl_init.c:1181
    #9 0x5612137c6a16 in erl_start beam/erl_init.c:1266
    #10 0x561213766b22 in main sys/unix/erl_main.c:30
    #11 0x7f776bcfbb96 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x21b96)

Thread T4 (1_scheduler) created by T0 here:
    #0 0x7f776cf12d2f in __interceptor_pthread_create (/usr/lib/gcc/x86_64-linux-gnu/7/libasan.so+0x37d2f)
    #1 0x56121397ea12 in ethr_thr_create pthread/ethread.c:400

SUMMARY: AddressSanitizer: double-free (/usr/lib/gcc/x86_64-linux-gnu/7/libasan.so+0xde7a8) in __interceptor_free
==30946==ABORTING```
mpope9 commented 3 years ago

Interesting, thank you for the detailed bug report. Running it ~10 times seems like it does show up intermittently, xor16 tests only. Will look into this more this eve. Have you ran into this while using the library or just with the test suite?

JayKickliter commented 3 years ago

I have only run this under ASan with the test suite, and not in an erlang application.

Some background on how I discovered this: filter serialization [to,from]_bin returns non-deterministic binaries. They can round trip [de]serialize just fine, but two identical filters serialize to different binaries. I figured it had something to do with uninitialized memory, and it turns out it does. Changing

#define malloc(size) enif_alloc(size)

to

#define malloc(size) calloc(size)

fixes the nondeterministic binary problem.

Don't get hung up on the above. It isn't (obviously? directly?) related to the issue I filed, but it's how I discovered it and may speak to wider memory management bugs.

mpope9 commented 3 years ago

I can see where the double free error can occur. destroy_xor8_filter_resource does not check for null before calling free. should be an easy enough fix. I'll see what I can do tonight and run the suite extensively to check for more segfaults.

JayKickliter commented 3 years ago

Running it ~10 times seems like it does show up intermittently, xor16 tests only.

Strange. It fails about 70% of the time for me.

We run all our c NIF modules CI tests with GH workflows and ASan enabled. If you're interested in setting that up and want some guidance, please feel free to reach out and @ me.

mpope9 commented 3 years ago

We do enif_release_resource(filter) to have the VM own the binary. It could be freed sometimes before we call out to destroy. I think this is the source of the non-determinism.

mpope9 commented 3 years ago

I was always meaning to setup that up. Might be the perfect time. I'll let you know if I run into any issues. Thanks.

JayKickliter commented 3 years ago

Here is the minimum reproduction:

$ ./rebar3 shell
1> Filter0 = xor16:new_empty().
{builder,#Ref<0.136826599.2258763777.81971>}
2> q().
ok
3> Segmentation fault (core dumped)