Open villepeh opened 3 months ago
Similar results with "manual" compressor, trying to compress Matrix HQ room.
# RUST_BACKTRACE=1 RUST_LOG=debug LD_PRELOAD=/usr/lib64/libjemalloc.so.2 ./synapse_compress_state -p "user=postgres dbname=matrix host=/run/postgresql" -r '!OGEhHVWSdvArJzumhm:matrix.org' -o out.sql -t -n 500 -b 170833
[2024-04-09T23:43:16Z INFO synapse_compress_state] Fetching state from DB for room '!OGEhHVWSdvArJzumhm:matrix.org'...
[2024-04-09T23:43:16Z DEBUG tokio_postgres::prepare] preparing query s0: SELECT id FROM (SELECT id FROM state_groups WHERE room_id = $1 AND id > $2 ORDER BY id ASC LIMIT $3) AS ids ORDER BY ids.id DESC LIMIT 1
[2024-04-09T23:43:16Z DEBUG tokio_postgres::query] executing statement s0 with parameters: ["!OGEhHVWSdvArJzumhm:matrix.org", Some(170833), Some(500)]
[2024-04-09T23:43:16Z DEBUG tokio_postgres::prepare] preparing query s1:
SELECT m.id, prev_state_group, type, state_key, s.event_id
FROM state_groups AS m
LEFT JOIN state_groups_state AS s ON (m.id = s.state_group)
LEFT JOIN state_group_edges AS e ON (m.id = e.state_group)
WHERE m.room_id = $1 AND m.id <= $2
AND m.id > $3
[2024-04-09T23:43:16Z DEBUG tokio_postgres::query] executing statement s1 with parameters: ["!OGEhHVWSdvArJzumhm:matrix.org", 173804, 170833]
[2m] 14444321 rows retrieved[2024-04-09T23:45:33Z DEBUG synapse_compress_state::database] Got initial state from database. Checking for any missing s
tate groups...
[2024-04-09T23:45:33Z INFO synapse_compress_state] Fetched state groups up to 173804
[2024-04-09T23:45:33Z INFO synapse_compress_state] Number of state groups: 500
[2024-04-09T23:45:33Z INFO synapse_compress_state] Number of rows in current table: 14444035
[2024-04-09T23:45:33Z INFO synapse_compress_state] Compressing state...
[00:01:32] ████████████████████ 500/500 state groups
[2024-04-09T23:47:06Z INFO synapse_compress_state] Number of rows after compression: 2943697 (20.38%)
[2024-04-09T23:47:06Z INFO synapse_compress_state] Compression Statistics:
[2024-04-09T23:47:06Z INFO synapse_compress_state]Number of forced resets due to lacking prev: 29
[2024-04-09T23:47:06Z INFO synapse_compress_state]Number of compressed rows caused by the above: 2680484
[2024-04-09T23:47:06Z INFO synapse_compress_state]Number of state groups changed: 161
[2024-04-09T23:47:06Z INFO synapse_compress_state] Checking that state maps match...
[00:00:00] ░░░░░░░░░░░░░░░░░░░░ 0/500 state groups
Segmentation fault (core dumped)
Coredump
# gdb /opt/rust-synapse-compress-state/target/debug/synapse_compress_state --core /root/core.synapse_compres.0.90ce0546c6a44f6ea07a0538e09d8004.1952675.1712706710000000
GNU gdb (GDB) Red Hat Enterprise Linux 10.2-11.1.el9_3
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /opt/rust-synapse-compress-state/target/debug/synapse_compress_state...
[New LWP 1952986]
[New LWP 1952675]
[New LWP 1952977]
[New LWP 1952976]
[New LWP 1952978]
[New LWP 1952979]
[New LWP 1952980]
[New LWP 1952981]
[New LWP 1952985]
[New LWP 1952975]
[New LWP 1952982]
[New LWP 1952983]
[New LWP 1952984]
[New LWP 1952987]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
--Type <RET> for more, q to quit, c to continue without paging--
Core was generated by `./synapse_compress_state -p user=postgres dbname=matrix host=/run/postgresql -r'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 ___pthread_mutex_trylock (mutex=mutex@entry=0x28e8) at pthread_mutex_trylock.c:34
34 switch (__builtin_expect (PTHREAD_MUTEX_TYPE_ELISION (mutex),
[Current thread is 1 (Thread 0x7f8317ff4640 (LWP 1952986))]
warning: Missing auto-load script at offset 0 in section .debug_gdb_scripts
of file /opt/rust-synapse-compress-state/target/debug/synapse_compress_state.
Use `info auto-load python-scripts [REGEXP]' to list them.
(gdb) bt
#0 ___pthread_mutex_trylock (mutex=mutex@entry=0x28e8) at pthread_mutex_trylock.c:34
#1 0x00007f831a080a08 in malloc_mutex_trylock_final (mutex=0x28a8) at include/jemalloc/internal/mutex.h:157
#2 malloc_mutex_lock (mutex=0x28a8, tsdn=0x7f8317ff2f88) at include/jemalloc/internal/mutex.h:216
#3 je_tcache_arena_associate (tsdn=tsdn@entry=0x7f8317ff2f88, tcache_slow=tcache_slow@entry=0x7f8317ff3088, tcache=tcache@entry=0x7f8317ff32e0, arena=arena@entry=0x0) at src/tcache.c:588
#4 0x00007f831a08442b in arena_choose_impl.constprop.1 (tsd=0x7f8317ff2f88, arena=<optimized out>, internal=false) at include/jemalloc/internal/jemalloc_internal_inlines_b.h:60
#5 0x00007f831a01d397 in arena_choose (arena=0x0, tsd=0x7f8317ff2f88) at include/jemalloc/internal/jemalloc_internal_inlines_b.h:88
#6 tcache_alloc_small (slow_path=<optimized out>, zero=true, binind=2, size=32, tcache=0x7f8317ff32e0, arena=0x0, tsd=0x7f8317ff2f88) at include/jemalloc/internal/tcache_inlines.h:56
#7 arena_malloc (slow_path=<optimized out>, tcache=0x7f8317ff32e0, zero=true, ind=2, size=32, arena=0x0, tsdn=0x7f8317ff2f88) at include/jemalloc/internal/arena_inlines_b.h:151
#8 iallocztm (slow_path=<optimized out>, arena=0x0, is_internal=false, tcache=0x7f8317ff32e0, zero=true, ind=2, size=32, tsdn=0x7f8317ff2f88) at include/jemalloc/internal/jemalloc_internal_inlines_c.h:55
#9 imalloc_no_sample (ind=2, usize=32, size=32, tsd=0x7f8317ff2f88, dopts=<synthetic pointer>, sopts=<synthetic pointer>) at src/jemalloc.c:2398
#10 imalloc_body (tsd=0x7f8317ff2f88, dopts=<synthetic pointer>, sopts=<synthetic pointer>) at src/jemalloc.c:2573
#11 imalloc (dopts=<optimized out>, sopts=<optimized out>) at src/jemalloc.c:2687
#12 calloc (num=num@entry=1, size=size@entry=32) at src/jemalloc.c:2852
#13 0x00007f8319657b43 in __cxa_thread_atexit_impl (func=0xff8c83398676680f, obj=0x7f8317ff4460, dso_symbol=0x5620ea26d730 <_rust_extern_with_linkage___dso_handle>) at cxa_thread_atexit_impl.c:107
#14 0x00005620ea009899 in std::sys::unix::stack_overflow::imp::signal_handler ()
#15 <signal handler called>
#16 ___pthread_mutex_trylock (mutex=mutex@entry=0x28e8) at pthread_mutex_trylock.c:34
#17 0x00007f831a080a08 in malloc_mutex_trylock_final (mutex=0x28a8) at include/jemalloc/internal/mutex.h:157
#18 malloc_mutex_lock (mutex=0x28a8, tsdn=0x7f8317ff2f88) at include/jemalloc/internal/mutex.h:216
#19 je_tcache_arena_associate (tsdn=tsdn@entry=0x7f8317ff2f88, tcache_slow=tcache_slow@entry=0x7f8317ff3088, tcache=tcache@entry=0x7f8317ff32e0, arena=arena@entry=0x0) at src/tcache.c:588
#20 0x00007f831a0859a8 in arena_choose_impl (arena=<optimized out>, internal=false, tsd=0x7f8317ff2f88) at include/jemalloc/internal/jemalloc_internal_inlines_b.h:60
#21 arena_choose_impl (arena=0x0, internal=false, tsd=0x7f8317ff2f88) at include/jemalloc/internal/jemalloc_internal_inlines_b.h:32
#22 arena_choose (arena=0x0, tsd=0x7f8317ff2f88) at include/jemalloc/internal/jemalloc_internal_inlines_b.h:88
#23 je_tsd_tcache_data_init.isra.0 (tsd=0x7f8317ff2f88) at src/tcache.c:740
#24 0x00007f831a085df9 in je_tsd_tcache_enabled_data_init (tsd=<optimized out>) at src/tcache.c:644
#25 0x00007f831a085e8c in je_tsd_fetch_slow.constprop.0 (minimal=minimal@entry=false, tsd=<optimized out>) at src/tsd.c:311
#26 0x00007f831a024445 in tsd_fetch_impl (minimal=false, init=true) at include/jemalloc/internal/tsd.h:422
#27 tsd_fetch () at include/jemalloc/internal/tsd.h:448
#28 imalloc (dopts=<synthetic pointer>, sopts=<synthetic pointer>) at src/jemalloc.c:2681
#29 realloc (ptr=ptr@entry=0x0, size=size@entry=32) at src/jemalloc.c:3653
#30 0x00007f83196a0bea in __pthread_getattr_np (thread_id=140201020048960, attr=0x7f8317ff23d0) at pthread_getattr_np.c:181
#31 0x00005620ea00a2a1 in std::sys::unix::thread::guard::current ()
--Type <RET> for more, q to quit, c to continue without paging--
#32 0x00005620e9e56523 in std::thread::{impl#0}::spawn_unchecked_::{closure#1}<rayon_core::registry::{impl#2}::spawn::{closure_env#0}, ()> () at /builddir/build/BUILD/rustc-1.71.1-src/library/std/src/thread/mod.rs:527
#33 0x00005620e9e2316f in core::ops::function::FnOnce::call_once<std::thread::{impl#0}::spawn_unchecked_::{closure_env#1}<rayon_core::registry::{impl#2}::spawn::{closure_env#0}, ()>, ()> ()
at /builddir/build/BUILD/rustc-1.71.1-src/library/core/src/ops/function.rs:250
#34 0x00005620ea00a095 in std::sys::unix::thread::Thread::new::thread_start ()
#35 0x00007f831969f802 in start_thread (arg=<optimized out>) at pthread_create.c:443
#36 0x00007f831963f450 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
Interesting. After I
...segfault no longer happens. The behavior seems weird regardless, but I've got nothing more to add to this except that the older version of didn't fix segfaults before I did the other steps.
If it's impossible to investigate and no one else is able to reproduce this, I think this issue can be closed.
From the stack trace, sounds like it could be a problem with jemallocator. So using a different version is probably a reasonable workaround. If it happens on the latest jemallocator version, we should probably look into this, otherwise I'm not sure there's much for us to do here.
What do you mean by 'mock-compiled' one?
From the stack trace, sounds like it could be a problem with jemallocator. So using a different version is probably a reasonable workaround. If it happens on the latest jemallocator version, we should probably look into this, otherwise I'm not sure there's much for us to do here.
I can't believe I forgot to post about the actual reason this kept happening. It was very likely this sysctl parameter: vm.overcommit_memory=2
. It was suggested here so I just went with it (bad idea). When I set it to default 0
the issue was gone. And compressor now works with the latest jemalloc as well.
I suppose the compressor crashing with the option enabled isn't intended behavior but I don't know if it's worth the trouble of fixing either.
0 - Heuristic overcommit handling. Obvious overcommits of
address space are refused. Used for a typical system. It
ensures a seriously wild allocation fails while allowing
overcommit to reduce swap usage. root is allowed to
allocate slightly more memory in this mode. This is the
default.
1 - Always overcommit. Appropriate for some scientific
applications. Classic example is code using sparse arrays
and just relying on the virtual memory consisting almost
entirely of zero pages.
2 - Don't overcommit. The total address space commit
for the system is not permitted to exceed swap + a
configurable amount (default is 50%) of physical RAM.
Depending on the amount you use, in most situations
this means a process will not be killed while accessing
pages but will receive errors on memory allocation as
appropriate.
Useful for applications that want to guarantee their
memory allocations will be available in the future
without having to initialize every page.
What do you mean by 'mock-compiled' one?
mock is a handy tool for creating RPMs. In short, some software might not be available for RHEL (and its clones like Oracle Linux and Rocky Linux) or might be quite old. For example, RHEL offers HAProxy 2.4 but I'd much rather run the latest 2.8 LTS.
Instead of working with the ./configure && make && make install
hassle, you can just grab a source RPM. Then run mock haproxy-2.8.5-1.fc39.src.rpm
and it handles the compilation voodoo itself and creates an installable .rpm for your distro. I did the same for jemalloc because EPEL offers version 5.2.1 rather than the latest 5.3.0 which is supposed to have several improvements and optimizations.
I used Synapse Admin API to purge some rooms that no longer had local users. After that I started seeing panic messages like the ones in #79.
I tried deleting the compressor entries from the database but now I'm getting segfaults.
I cloned the repository again and rebuilt auto compressor with
cargo build
but the result is the same. Running the command withsudo -u postgres
orroot
makes no difference.I tried to get some debug info with GDB: