Many people report scheduler panics recently

Darksonn commented 5 months ago

Over the past few days, we've received an abnormal number of issues reporting a panic in the scheduler that really shouldn't happen. I'm wondering whether a recent change in a dependency or Rust itself is causing memory corruption.

If this is also happening to you, please add a comment to this issue. Please make sure to include your Tokio version and Rust version, and the versions of other relevant dependencies (e.g. include your lock file).

I will be closing the other issues as duplicates of this one, so see the references below for the list.

Noah-Kennedy commented 5 months ago

Getting the rustc versions and lockfiles from reporters would be a good start

Noah-Kennedy commented 5 months ago

Sadly I cannot ping people from the other threads in here.

jcsp commented 5 months ago

Summary of our instance of this issue (https://github.com/neondatabase/neon/issues/7162):

Tokio version 1.36
So far we've only hit the assertion from a single place in our code: https://github.com/neondatabase/neon/blob/77f3a30440aba4845da3a5203a2764fed4d96648/pageserver/src/tenant/throttle.rs#L146
We've hit it twice over the course of a few weeks (since we added that piece of code) across a fleet of a few dozen instances of this service: so not super frequent, but not super rare either.

Noah-Kennedy commented 5 months ago

@jcsp what rustc version was this?

VladLazar commented 5 months ago

@Noah-Kennedy rustc version was 1.76.0

Adding to what John said above, we also hit this assertion in the scheduler. Can provide the stack trace, but I doubt it would be of much use.

Noah-Kennedy commented 5 months ago

@VladLazar did you do any recent bumps other than to tokio in your dependency tree?

I'd like to dig in further with you folks, you are the best repro rn it seems.

Noah-Kennedy commented 5 months ago

We are currently suspecting that this originates in a dependency, and are trying to track down what that may be.

Noah-Kennedy commented 5 months ago

Also @jcsp @VladLazar what platform are you guys seeing this on?

Noah-Kennedy commented 5 months ago

So, I am currently trying to narrow down the list of possible external dependencies which this could be.

My current ask to anyone who comes across this issue is that you look through any other recent changes in your dependency tree and bring that information to this thread.

VladLazar commented 5 months ago

@Noah-Kennedy thanks for looking into this.

what platform are you guys seeing this on?

x86_64, Linux 5.10.0-14-amd64 Debian

did you do any recent bumps other than to tokio in your dependency tree?

Here's the list of recent changes which predate the panics from our lockfile:

rustls 0.21 -> 0.22
syn 2.0.32 -> 2.0.52
tokio-rustls 0.24 -> 0.25
bytes 1.4 -> 1.5
mio 0.8.10 -> 0.8.11
tokio 1.34 -> 1.36 (Mar 4)

Let us know if there's anything else we can help with.

Noah-Kennedy commented 5 months ago

@Noah-Kennedy thanks for looking into this.

what platform are you guys seeing this on?

x86_64, Linux 5.10.0-14-amd64 Debian

did you do any recent bumps other than to tokio in your dependency tree?

Here's the list of recent changes which predate the panics from our lockfile:
rustls 0.21 -> 0.22
syn 2.0.32 -> 2.0.52
tokio-rustls 0.24 -> 0.25
bytes 1.4 -> 1.5
mio 0.8.10 -> 0.8.11
tokio 1.34 -> 1.36 (Mar 4)
Let us know if there's anything else we can help with.

tysm!

Next steps here are to go through any other changes in your codebase that stand out to confirm we aren't conflating things, and to go through the changes in this list.

I'm about to grab lunch, but will go through these crates after.

Noah-Kennedy commented 5 months ago

@VladLazar looking at https://github.com/neondatabase/neon/pull/7048 I suspect there may be more to go through.

Noah-Kennedy commented 5 months ago

The listed deps are all pretty old for the most part, I feel like we would have heard about issues with them by now.

Noah-Kennedy commented 5 months ago

@VladLazar you or someone else with Neon will need to dig in more, possibly using tools like valgrind. I am not seeing anything noteworthy.

hallmaxw commented 5 months ago

For #6416, we were on Rust version 1.76. I'm still working out any recent changes in dependencies. We deploy frequently and this issue is rare, so who knows how far back a change was deployed that could have resulted in this.

VladLazar commented 5 months ago

@VladLazar looking at https://github.com/neondatabase/neon/pull/7048 I suspect there may be more to go through.

@Noah-Kennedy Yeah, I did that initial check manually. I've scripted it and went through all neondatabase/neon dependency updates since the 1st of Feb 2024. Attached the results to this comment.

common_deps.json: List of common dependencies between neondatabase/neon @ first_panic_commit_hash and tokio-rs/tokio @ 1.36.0 recent_updates_of_common_deps.json: updates of common dependencies in neondatabase/neon since the 1st of Feb 2024 up to first_panic_commit_hash recent_updates.json: updates of all dependencies in neondatabase/neon since the 1st of Feb 2024 up to first_panic_commit_hash

The rust compiler had also been updated pretty recently on 8th of Feb: 1.75.0 -> 1.76.0

Noah-Kennedy commented 5 months ago

@hallmaxw have you seen this again since then, and how frequently has this occurred?

hallmaxw commented 5 months ago

It only happened once. We haven't seen it since then.

huyufeifei commented 5 months ago

Same situation. It panics three days ago and now runs with no problem.

VladLazar commented 4 months ago

@Noah-Kennedy don't mean to impose, but have you looked into this further by any chance?

We've hit the same class of life-cycle assertion again yesterday on tokio 1.37.0 (this one more specifically). I can provide the backtrace, but it's not very interesting.

dd-dreams commented 4 months ago

What is the message ("current: {}, sub: {}")?

robert-sjoblom commented 4 months ago

I don't know if this is relevant, but I'm getting an illegal opcode that might be of interest?

Context: I'm spawning ~400 connections to various database clusters, like so:

pub(crate) async fn scan(
    tx: Sender<Message<Executioner>>,
) -> Result<(), crate::errors::Error> {
    let hosts = constants::Cluster::get_hosts();

    let mut join_handles = JoinSet::new();

    hosts
        .into_iter()
        .flat_map(|v: Cluster<'_>| {
            let tx = tx.clone();
            v.databases.iter().map(move |db| (v, db, tx.clone()))
        })
        .for_each(|(cluster, db, tx)| {
            join_handles.spawn(F3Connection::new(cluster, db, tx));
        });

    let mut connection_results = Vec::new();
    while let Some(res) = join_handles.join_next().await {
        let conn_res = match res {
            Ok(conn_res) => conn_res,
            Err(e) => return Err(e.into()),
        };

        connection_results.push(conn_res);
    }

    let connections = connection_results
        .into_iter()
        .filter(|v| {
            // Ignore the following error, but keep everything else
            if let Err(e) = v {
                let msg = format!("{e}");
                !(msg.contains("\"schemas_cust\" does not exist")
                    || msg.contains("\"schemas_tech\" does not exist"))
            } else {
                true
            }
        })
        .collect::<Result<Vec<_>, sqlx::Error>>()?;

    let mut join_handles = JoinSet::new();

    for conn in connections {
        join_handles.spawn(async move { conn.scan_f3_tenants().await });
    }

    while let Some(res) = join_handles.join_next().await {
        match res {
            Ok(scan_res) => match scan_res {
                Ok(_) => continue,
                Err(e) => return Err(e.into()),
            },
            Err(e) => return Err(e.into()),
        };
    }

    Ok(())
}

What's interesting is that when I run the binary in debug mode, it works, but if I run it through valgrind (or on k8s), it doesn't:

valgrind -v --leak-check=full ./target/debug/f3-exterminator  > out.txt
==1744221== Memcheck, a memory error detector
==1744221== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==1744221== Using Valgrind-3.18.1-42b08ed5bd-20211015 and LibVEX; rerun with -h for copyright info
==1744221== Command: ./target/debug/f3-exterminator
==1744221== 
--1744221-- Valgrind options:
--1744221--    -v
--1744221--    --leak-check=full
--1744221-- Contents of /proc/version:
--1744221--   Linux version 6.5.0-26-generic (buildd@lcy02-amd64-051) (x86_64-linux-gnu-gcc-12 (Ubuntu 12.3.0-1ubuntu1~22.04) 12.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #26~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Tue Mar 12 10:22:43 UTC 2
--1744221-- 
--1744221-- Arch and hwcaps: AMD64, LittleEndian, amd64-cx16-lzcnt-rdtscp-sse3-ssse3-avx-avx2-bmi-f16c-rdrand-rdseed
--1744221-- Page sizes: currently 4096, max supported 4096
--1744221-- Valgrind library directory: /usr/libexec/valgrind
--1744221-- Reading syms from /home/robert.sjoblom@fnox.it/work/f3-exterminator/target/debug/f3-exterminator
--1744221-- Reading syms from /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
--1744221--   Considering /usr/lib/debug/.build-id/24/6ac0d8deba5a40c63e9a1a87d4d779d8eb589f.debug ..
--1744221--   .. build-id is valid
--1744221-- Reading syms from /usr/libexec/valgrind/memcheck-amd64-linux
--1744221--    object doesn't have a symbol table
--1744221--    object doesn't have a dynamic symbol table
--1744221-- Scheduler: using generic scheduler lock implementation.
--1744221-- Reading suppressions file: /usr/libexec/valgrind/default.supp
==1744221== embedded gdbserver: reading from /tmp/vgdb-pipe-from-vgdb-to-1744221-by-robert.sjoblom@fnox.it-on-???
==1744221== embedded gdbserver: writing to   /tmp/vgdb-pipe-to-vgdb-from-1744221-by-robert.sjoblom@fnox.it-on-???
==1744221== embedded gdbserver: shared mem   /tmp/vgdb-pipe-shared-mem-vgdb-1744221-by-robert.sjoblom@fnox.it-on-???
==1744221== 
==1744221== TO CONTROL THIS PROCESS USING vgdb (which you probably
==1744221== don't want to do, unless you know exactly what you're doing,
==1744221== or are doing some strange experiment):
==1744221==   /usr/bin/vgdb --pid=1744221 ...command...
==1744221== 
==1744221== TO DEBUG THIS PROCESS USING GDB: start GDB like this
==1744221==   /path/to/gdb ./target/debug/f3-exterminator
==1744221== and then give GDB the following command
==1744221==   target remote | /usr/bin/vgdb --pid=1744221
==1744221== --pid is optional if only one valgrind process is running
==1744221== 
--1744221-- REDIR: 0x402aa40 (ld-linux-x86-64.so.2:strlen) redirected to 0x580bcec2 (???)
--1744221-- REDIR: 0x402a810 (ld-linux-x86-64.so.2:index) redirected to 0x580bcedc (???)
--1744221-- Reading syms from /usr/libexec/valgrind/vgpreload_core-amd64-linux.so
--1744221--    object doesn't have a symbol table
--1744221-- Reading syms from /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so
--1744221--    object doesn't have a symbol table
==1744221== WARNING: new redirection conflicts with existing -- ignoring it
--1744221--     old: 0x0402aa40 (strlen              ) R-> (0000.0) 0x580bcec2 ???
--1744221--     new: 0x0402aa40 (strlen              ) R-> (2007.0) 0x0484ee30 strlen
--1744221-- REDIR: 0x4027220 (ld-linux-x86-64.so.2:strcmp) redirected to 0x484fcd0 (strcmp)
--1744221-- REDIR: 0x402afa0 (ld-linux-x86-64.so.2:mempcpy) redirected to 0x4853840 (mempcpy)
--1744221-- Reading syms from /usr/lib/x86_64-linux-gnu/libssl.so.3
--1744221--    object doesn't have a symbol table
--1744221-- Reading syms from /usr/lib/x86_64-linux-gnu/libcrypto.so.3
--1744221--    object doesn't have a symbol table
--1744221-- Reading syms from /usr/lib/x86_64-linux-gnu/libgcc_s.so.1
--1744221--    object doesn't have a symbol table
--1744221-- Reading syms from /usr/lib/x86_64-linux-gnu/libm.so.6
--1744221--   Considering /usr/lib/debug/.build-id/b5/5ba6d8b5954b479fab2a69a54b9f56451fbee0.debug ..
--1744221--   .. build-id is valid
--1744221-- Reading syms from /usr/lib/x86_64-linux-gnu/libc.so.6
--1744221--   Considering /usr/lib/debug/.build-id/96/2015aa9d133c6cbcfb31ec300596d7f44d3348.debug ..
--1744221--   .. build-id is valid
==1744221== WARNING: new redirection conflicts with existing -- ignoring it
--1744221--     old: 0x04f0ac60 (memalign            ) R-> (1011.0) 0x0484e080 memalign
--1744221--     new: 0x04f0ac60 (memalign            ) R-> (1017.0) 0x0484e050 aligned_alloc
==1744221== WARNING: new redirection conflicts with existing -- ignoring it
--1744221--     old: 0x04f0ac60 (memalign            ) R-> (1011.0) 0x0484e080 memalign
--1744221--     new: 0x04f0ac60 (memalign            ) R-> (1017.0) 0x0484e020 aligned_alloc
==1744221== WARNING: new redirection conflicts with existing -- ignoring it
--1744221--     old: 0x04f0ac60 (memalign            ) R-> (1011.0) 0x0484e080 memalign
--1744221--     new: 0x04f0ac60 (memalign            ) R-> (1017.0) 0x0484e050 aligned_alloc
==1744221== WARNING: new redirection conflicts with existing -- ignoring it
--1744221--     old: 0x04f0ac60 (memalign            ) R-> (1011.0) 0x0484e080 memalign
--1744221--     new: 0x04f0ac60 (memalign            ) R-> (1017.0) 0x0484e020 aligned_alloc
--1744221-- REDIR: 0x4f0d720 (libc.so.6:strnlen) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0d9e0 (libc.so.6:strpbrk) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0d2e0 (libc.so.6:strcmp) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f2bcd0 (libc.so.6:wcsnlen) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0e6d0 (libc.so.6:memset) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f2a540 (libc.so.6:wcslen) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0ec10 (libc.so.6:memcpy@@GLIBC_2.14) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f2a370 (libc.so.6:wcschr) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0d260 (libc.so.6:index) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0d960 (libc.so.6:rindex) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0e5a0 (libc.so.6:memmove) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
==1744221== Preferring higher priority redirection:
--1744221--     old: 0x050057c0 (__memcpy_avx_unalign) R-> (2018.0) 0x04850f90 __memcpy_avx_unaligned_erms
--1744221--     new: 0x050057c0 (__memcpy_avx_unalign) R-> (2018.1) 0x04852880 memmove
--1744221-- REDIR: 0x4f2a3f0 (libc.so.6:wcscmp) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0e9e0 (libc.so.6:stpncpy) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f2a9b0 (libc.so.6:wmemchr) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0d830 (libc.so.6:strncmp) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0ea70 (libc.so.6:strcasecmp) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0d420 (libc.so.6:strcspn) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f2a470 (libc.so.6:wcscpy) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0d1d0 (libc.so.6:strcat) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0eb60 (libc.so.6:strncasecmp_l) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0e480 (libc.so.6:bcmp) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f16570 (libc.so.6:memrchr) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f10010 (libc.so.6:strchrnul) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0d390 (libc.so.6:strcpy) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0eb10 (libc.so.6:strcasecmp_l) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0d6a0 (libc.so.6:strlen) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0d8d0 (libc.so.6:strncpy) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0e950 (libc.so.6:stpcpy) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0e3f0 (libc.so.6:memchr) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0db00 (libc.so.6:strspn) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0e7d0 (libc.so.6:mempcpy) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0eac0 (libc.so.6:strncasecmp) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0ff80 (libc.so.6:rawmemchr) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f99280 (libc.so.6:__memcpy_chk) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x4f0e310 (libc.so.6:strstr) redirected to 0x483f220 (_vgnU_ifunc_wrapper)
--1744221-- REDIR: 0x5002610 (libc.so.6:__strrchr_avx2) redirected to 0x484e810 (rindex)
--1744221-- REDIR: 0x50027e0 (libc.so.6:__strlen_avx2) redirected to 0x484ed10 (strlen)
--1744221-- REDIR: 0x4f0a0a0 (libc.so.6:malloc) redirected to 0x4848820 (malloc)
--1744221-- REDIR: 0x4f0dfc0 (libc.so.6:__GI_strstr) redirected to 0x4853ae0 (__strstr_sse2)
--1744221-- REDIR: 0x4ffe2c0 (libc.so.6:__memchr_avx2) redirected to 0x484fd50 (memchr)
--1744221-- REDIR: 0x4f0a740 (libc.so.6:realloc) redirected to 0x484dc50 (realloc)
--1744221-- REDIR: 0x50057c0 (libc.so.6:__memcpy_avx_unaligned_erms) redirected to 0x4852880 (memmove)
--1744221-- REDIR: 0x4ffe590 (libc.so.6:__rawmemchr_avx2) redirected to 0x4853360 (rawmemchr)
--1744221-- REDIR: 0x4f0a3e0 (libc.so.6:free) redirected to 0x484b210 (free)
--1744221-- REDIR: 0x5005f80 (libc.so.6:__memset_avx2_unaligned_erms) redirected to 0x4852770 (memset)
--1744221-- REDIR: 0x4f0b520 (libc.so.6:calloc) redirected to 0x484d9d0 (calloc)
--1744221-- REDIR: 0x4ffea80 (libc.so.6:__memcmp_avx2_movbe) redirected to 0x4852010 (bcmp)
--1744221-- REDIR: 0x4f0c0e0 (libc.so.6:posix_memalign) redirected to 0x484df60 (posix_memalign)
--1744221-- REDIR: 0x4ffd940 (libc.so.6:__strcmp_avx2) redirected to 0x484fbd0 (strcmp)
--1744221-- REDIR: 0x4ffdd80 (libc.so.6:__strncmp_avx2) redirected to 0x484f3e0 (strncmp)
--1744221-- REDIR: 0x50007b0 (libc.so.6:__strncasecmp_avx) redirected to 0x484f640 (strncasecmp)
--1744221-- REDIR: 0x5002180 (libc.so.6:__strchr_avx2) redirected to 0x484e9f0 (index)
--1744221-- REDIR: 0x5002400 (libc.so.6:__strchrnul_avx2) redirected to 0x4853330 (strchrnul)
--1744221-- Reading syms from /usr/lib/x86_64-linux-gnu/libnss_systemd.so.2
--1744221--    object doesn't have a symbol table
--1744221-- REDIR: 0x4f0bb10 (libc.so.6:malloc_usable_size) redirected to 0x484e1b0 (malloc_usable_size)
--1744221-- REDIR: 0x5002960 (libc.so.6:__strnlen_avx2) redirected to 0x484ecb0 (strnlen)
--1744221-- REDIR: 0x5003cb0 (libc.so.6:__strcpy_avx2) redirected to 0x484ee60 (strcpy)
--1744221-- REDIR: 0x50049a0 (libc.so.6:__stpcpy_avx2) redirected to 0x4852130 (stpcpy)
--1744221-- REDIR: 0x4ffd800 (libc.so.6:__strspn_sse42) redirected to 0x4853d50 (strspn)
--1744221-- Reading syms from /usr/lib/x86_64-linux-gnu/libnss_sss.so.2
--1744221--    object doesn't have a symbol table
--1744221-- REDIR: 0x4ffd6e0 (libc.so.6:__strpbrk_sse42) redirected to 0x4853c00 (strpbrk)
--1744221-- REDIR: 0x4fff130 (libc.so.6:__strcasecmp_avx) redirected to 0x484f560 (strcasecmp)
--1744221-- Reading syms from /usr/lib/x86_64-linux-gnu/libnss_mdns4_minimal.so.2
--1744221--    object doesn't have a symbol table
--1744221-- REDIR: 0x4eaa150 (libc.so.6:setenv) redirected to 0x48542d0 (setenv)
--1744221-- REDIR: 0x5005780 (libc.so.6:__mempcpy_avx_unaligned_erms) redirected to 0x4853440 (mempcpy)
--1744221-- REDIR: 0x5005790 (libc.so.6:__memcpy_chk_avx_unaligned_erms) redirected to 0x4853940 (__memcpy_chk)
--1744221-- REDIR: 0x5004040 (libc.so.6:__strncpy_avx2) redirected to 0x484efe0 (strncpy)
vex amd64->IR: unhandled instruction bytes: 0xF 0x38 0xCB 0xCA 0xC5 0xF8 0x28 0xC1 0xC5 0xF9
vex amd64->IR:   REX=0 REX.W=0 REX.R=0 REX.X=0 REX.B=0
vex amd64->IR:   VEX=0 VEX.L=0 VEX.nVVVV=0x0 ESC=0F38
vex amd64->IR:   PFX.66=0 PFX.F2=0 PFX.F3=0
==1744221== valgrind: Unrecognised instruction at address 0x639471.
==1744221==    at 0x639471: core::core_arch::x86::sha::_mm_sha256rnds2_epu32 (sha.rs:118)
==1744221==    by 0x63A4FE: sha2::sha256::x86::digest_blocks (x86.rs:69)
==1744221==    by 0x63C6E4: sha2::sha256::x86::compress (x86.rs:107)
==1744221==    by 0x63D2CB: sha2::sha256::compress256 (sha256.rs:39)
==1744221==    by 0x52B0DD: <sha2::core_api::Sha256VarCore as digest::core_api::UpdateCore>::update_blocks (core_api.rs:37)
==1744221==    by 0x58FB16: <digest::core_api::ct_variable::CtVariableCoreWrapper<T,OutSize,O> as digest::core_api::UpdateCore>::update_blocks (ct_variable.rs:79)
==1744221==    by 0x52BE29: <digest::core_api::wrapper::CoreWrapper<T> as crypto_common::KeyInit>::new_from_slice (optim.rs:158)
==1744221==    by 0x53AAD5: sqlx_postgres::connection::sasl::hi (mac.rs:111)
==1744221==    by 0x538727: sqlx_postgres::connection::sasl::authenticate::{{closure}} (sasl.rs:91)
==1744221==    by 0x52F515: sqlx_postgres::connection::establish::<impl sqlx_postgres::connection::PgConnection>::establish::{{closure}} (establish.rs:100)
==1744221==    by 0x50EB03: <core::pin::Pin<P> as core::future::future::Future>::poll (future.rs:123)
==1744221==    by 0x5B3358: <tokio::time::timeout::Timeout<T> as core::future::future::Future>::poll (timeout.rs:202)
==1744221== Your program just tried to execute an instruction that Valgrind
==1744221== did not recognise.  There are two possible reasons for this.
==1744221== 1. Your program has a bug and erroneously jumped to a non-code
==1744221==    location.  If you are running Memcheck and you just saw a
==1744221==    warning about a bad jump, it's probably your program's fault.
==1744221== 2. The instruction is legitimate but Valgrind doesn't handle it,
==1744221==    i.e. it's Valgrind's fault.  If you think this is the case or
==1744221==    you are not sure, please let us know and we'll try to fix it.
==1744221== Either way, Valgrind will now raise a SIGILL signal which will
==1744221== probably kill your program.
==1744221== 
==1744221== Process terminating with default action of signal 4 (SIGILL): dumping core
==1744221==  Illegal opcode at address 0x639471
==1744221==    at 0x639471: core::core_arch::x86::sha::_mm_sha256rnds2_epu32 (sha.rs:118)
==1744221==    by 0x63A4FE: sha2::sha256::x86::digest_blocks (x86.rs:69)
==1744221==    by 0x63C6E4: sha2::sha256::x86::compress (x86.rs:107)
==1744221==    by 0x63D2CB: sha2::sha256::compress256 (sha256.rs:39)
==1744221==    by 0x52B0DD: <sha2::core_api::Sha256VarCore as digest::core_api::UpdateCore>::update_blocks (core_api.rs:37)
==1744221==    by 0x58FB16: <digest::core_api::ct_variable::CtVariableCoreWrapper<T,OutSize,O> as digest::core_api::UpdateCore>::update_blocks (ct_variable.rs:79)
==1744221==    by 0x52BE29: <digest::core_api::wrapper::CoreWrapper<T> as crypto_common::KeyInit>::new_from_slice (optim.rs:158)
==1744221==    by 0x53AAD5: sqlx_postgres::connection::sasl::hi (mac.rs:111)
==1744221==    by 0x538727: sqlx_postgres::connection::sasl::authenticate::{{closure}} (sasl.rs:91)
==1744221==    by 0x52F515: sqlx_postgres::connection::establish::<impl sqlx_postgres::connection::PgConnection>::establish::{{closure}} (establish.rs:100)
==1744221==    by 0x50EB03: <core::pin::Pin<P> as core::future::future::Future>::poll (future.rs:123)
==1744221==    by 0x5B3358: <tokio::time::timeout::Timeout<T> as core::future::future::Future>::poll (timeout.rs:202)
==1744221== 
==1744221== HEAP SUMMARY:
==1744221==     in use at exit: 1,858,055 bytes in 17,954 blocks
==1744221==   total heap usage: 153,156 allocs, 135,202 frees, 8,212,941 bytes allocated
==1744221== 
==1744221== Searching for pointers to 17,954 not-freed blocks
==1744221== Checked 37,630,096 bytes
==1744221== 
==1744221== 8 bytes in 1 blocks are possibly lost in loss record 24 of 1,330
==1744221==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1744221==    by 0x7BCD2A: alloc::alloc::alloc (alloc.rs:100)
==1744221==    by 0x7BCE8C: alloc::alloc::Global::alloc_impl (alloc.rs:183)
==1744221==    by 0x7BDE28: <alloc::alloc::Global as core::alloc::Allocator>::allocate (alloc.rs:243)
==1744221==    by 0x7BA8FB: alloc::raw_vec::finish_grow (raw_vec.rs:562)
==1744221==    by 0x7BAE39: alloc::raw_vec::RawVec<T,A>::grow_amortized (raw_vec.rs:474)
==1744221==    by 0x7BBE58: alloc::raw_vec::RawVec<T,A>::reserve::do_reserve_and_handle (raw_vec.rs:338)
==1744221==    by 0x7C0A98: reserve<u8, alloc::alloc::Global> (raw_vec.rs:344)
==1744221==    by 0x7C0A98: alloc::vec::Vec<T,A>::reserve (mod.rs:945)
==1744221==    by 0x7C05E7: alloc::vec::Vec<T,A>::append_elements (mod.rs:2116)
==1744221==    by 0x7C0008: <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<&T,core::slice::iter::Iter<T>>>::spec_extend (spec_extend.rs:55)
==1744221==    by 0x7C0815: alloc::vec::Vec<T,A>::extend_from_slice (mod.rs:2562)
==1744221==    by 0x773D07: <alloc::string::String as core::fmt::Write>::write_str (string.rs:1067)
==1744221== 
==1744221== 32 bytes in 1 blocks are possibly lost in loss record 733 of 1,330
==1744221==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1744221==    by 0x4C7D3A: alloc::alloc::alloc (alloc.rs:100)
==1744221==    by 0x4C7E9C: alloc::alloc::Global::alloc_impl (alloc.rs:183)
==1744221==    by 0x4C7C6B: alloc::alloc::exchange_malloc (alloc.rs:243)
==1744221==    by 0x49F0CD: new<tracing_subscriber::fmt::fmt_layer::FormattedFields<tracing_subscriber::fmt::format::json::JsonFields>> (boxed.rs:218)
==1744221==    by 0x49F0CD: tracing_subscriber::registry::extensions::ExtensionsInner::insert (extensions.rs:138)
==1744221==    by 0x49EF82: tracing_subscriber::registry::extensions::ExtensionsMut::replace (extensions.rs:95)
==1744221==    by 0x49EE5C: tracing_subscriber::registry::extensions::ExtensionsMut::insert (extensions.rs:88)
==1744221==    by 0x45ECFF: <tracing_subscriber::fmt::fmt_layer::Layer<S,N,E,W> as tracing_subscriber::layer::Layer<S>>::on_new_span (fmt_layer.rs:823)
==1744221==    by 0x46176C: <tracing_subscriber::filter::layer_filters::Filtered<L,F,S> as tracing_subscriber::layer::Layer<S>>::on_new_span::{{closure}} (mod.rs:782)
==1744221==    by 0x462F2B: tracing_subscriber::filter::layer_filters::FilterState::did_enable (mod.rs:1167)
==1744221==    by 0x4634C3: tracing_subscriber::filter::layer_filters::Filtered<L,F,S>::did_enable::{{closure}} (mod.rs:627)
==1744221==    by 0x494844: std::thread::local::LocalKey<T>::try_with (local.rs:284)
==1744221== 
==1744221== 148 bytes in 1 blocks are possibly lost in loss record 962 of 1,330
==1744221==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1744221==    by 0x8B2A8A: alloc::alloc::alloc (alloc.rs:100)
==1744221==    by 0x8B2BEC: alloc::alloc::Global::alloc_impl (alloc.rs:183)
==1744221==    by 0x8B34D8: <alloc::alloc::Global as core::alloc::Allocator>::allocate (alloc.rs:243)
==1744221==    by 0x906FD7: hashbrown::raw::alloc::inner::do_alloc (alloc.rs:15)
==1744221==    by 0x8ABE52: hashbrown::raw::RawTableInner::new_uninitialized (mod.rs:1752)
==1744221==    by 0x8AC26D: hashbrown::raw::RawTableInner::fallible_with_capacity (mod.rs:1790)
==1744221==    by 0x8AAC83: hashbrown::raw::RawTableInner::prepare_resize (mod.rs:2869)
==1744221==    by 0x735CC4: resize_inner<alloc::alloc::Global> (mod.rs:3065)
==1744221==    by 0x735CC4: reserve_rehash_inner<alloc::alloc::Global> (mod.rs:2955)
==1744221==    by 0x735CC4: hashbrown::raw::RawTable<T,A>::reserve_rehash (mod.rs:1233)
==1744221==    by 0x73C124: hashbrown::raw::RawTable<T,A>::reserve (mod.rs:1181)
==1744221==    by 0x739839: hashbrown::raw::RawTable<T,A>::find_or_find_insert_slot (mod.rs:1415)
==1744221==    by 0x75C57C: hashbrown::map::HashMap<K,V,S,A>::insert (map.rs:1754)
==1744221== 
==1744221== 216 bytes in 1 blocks are possibly lost in loss record 1,017 of 1,330
==1744221==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1744221==    by 0x8B2A8A: alloc::alloc::alloc (alloc.rs:100)
==1744221==    by 0x8B2BEC: alloc::alloc::Global::alloc_impl (alloc.rs:183)
==1744221==    by 0x8B34D8: <alloc::alloc::Global as core::alloc::Allocator>::allocate (alloc.rs:243)
==1744221==    by 0x906FD7: hashbrown::raw::alloc::inner::do_alloc (alloc.rs:15)
==1744221==    by 0x8ABE52: hashbrown::raw::RawTableInner::new_uninitialized (mod.rs:1752)
==1744221==    by 0x8AC26D: hashbrown::raw::RawTableInner::fallible_with_capacity (mod.rs:1790)
==1744221==    by 0x8AAC83: hashbrown::raw::RawTableInner::prepare_resize (mod.rs:2869)
==1744221==    by 0x8AD8FE: resize_inner<alloc::alloc::Global> (mod.rs:3065)
==1744221==    by 0x8AD8FE: reserve_rehash_inner<alloc::alloc::Global> (mod.rs:2955)
==1744221==    by 0x8AD8FE: hashbrown::raw::RawTable<T,A>::reserve_rehash (mod.rs:1233)
==1744221==    by 0x8AEAE4: hashbrown::raw::RawTable<T,A>::reserve (mod.rs:1181)
==1744221==    by 0x8AE289: hashbrown::raw::RawTable<T,A>::find_or_find_insert_slot (mod.rs:1415)
==1744221==    by 0x8B8187: hashbrown::map::HashMap<K,V,S,A>::insert (map.rs:1754)
==1744221== 
==1744221== 304 bytes in 1 blocks are possibly lost in loss record 1,050 of 1,330
==1744221==    at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1744221==    by 0x40147D9: calloc (rtld-malloc.h:44)
==1744221==    by 0x40147D9: allocate_dtv (dl-tls.c:375)
==1744221==    by 0x40147D9: _dl_allocate_tls (dl-tls.c:634)
==1744221==    by 0x4EFA7B4: allocate_stack (allocatestack.c:430)
==1744221==    by 0x4EFA7B4: pthread_create@@GLIBC_2.34 (pthread_create.c:647)
==1744221==    by 0xAC93C1: std::sys::pal::unix::thread::Thread::new (thread.rs:87)
==1744221==    by 0xA10B52: std::thread::Builder::spawn_unchecked_ (mod.rs:561)
==1744221==    by 0xA1001E: std::thread::Builder::spawn_unchecked (mod.rs:455)
==1744221==    by 0xA111DD: std::thread::Builder::spawn (mod.rs:387)
==1744221==    by 0xA002CB: tokio::runtime::blocking::pool::Spawner::spawn_thread (pool.rs:468)
==1744221==    by 0x9FFB98: tokio::runtime::blocking::pool::Spawner::spawn_task (pool.rs:418)
==1744221==    by 0x9FEE92: tokio::runtime::blocking::pool::Spawner::spawn_blocking_inner (pool.rs:386)
==1744221==    by 0x9FE565: tokio::runtime::blocking::pool::Spawner::spawn_blocking (pool.rs:301)
==1744221==    by 0xA298BE: tokio::runtime::handle::Handle::spawn_blocking (handle.rs:219)
==1744221== 
==1744221== 1,072 bytes in 1 blocks are possibly lost in loss record 1,182 of 1,330
==1744221==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1744221==    by 0xA6FE1A: alloc::alloc::alloc (alloc.rs:100)
==1744221==    by 0xA6FF7C: alloc::alloc::Global::alloc_impl (alloc.rs:183)
==1744221==    by 0xA70B28: <alloc::alloc::Global as core::alloc::Allocator>::allocate (alloc.rs:243)
==1744221==    by 0xA77467: hashbrown::raw::alloc::inner::do_alloc (alloc.rs:15)
==1744221==    by 0xA6A8F2: hashbrown::raw::RawTableInner::new_uninitialized (mod.rs:1752)
==1744221==    by 0xA6AC4D: hashbrown::raw::RawTableInner::fallible_with_capacity (mod.rs:1790)
==1744221==    by 0xA6A1B3: hashbrown::raw::RawTableInner::prepare_resize (mod.rs:2869)
==1744221==    by 0x9F5204: resize_inner<alloc::alloc::Global> (mod.rs:3065)
==1744221==    by 0x9F5204: reserve_rehash_inner<alloc::alloc::Global> (mod.rs:2955)
==1744221==    by 0x9F5204: hashbrown::raw::RawTable<T,A>::reserve_rehash (mod.rs:1233)
==1744221==    by 0x9F7944: hashbrown::raw::RawTable<T,A>::reserve (mod.rs:1181)
==1744221==    by 0x9F69D9: hashbrown::raw::RawTable<T,A>::find_or_find_insert_slot (mod.rs:1415)
==1744221==    by 0xA13A8E: hashbrown::map::HashMap<K,V,S,A>::insert (map.rs:1754)
==1744221== 
==1744221== 4,608 bytes in 2 blocks are possibly lost in loss record 1,273 of 1,330
==1744221==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1744221==    by 0x4005D97: malloc (rtld-malloc.h:56)
==1744221==    by 0x4005D97: _dlfo_mappings_segment_allocate (dl-find_object.c:217)
==1744221==    by 0x4005D97: _dl_find_object_update_1 (dl-find_object.c:671)
==1744221==    by 0x4005D97: _dl_find_object_update (dl-find_object.c:804)
==1744221==    by 0x400ECCF: dl_open_worker_begin (dl-open.c:735)
==1744221==    by 0x4FD9A97: _dl_catch_exception (dl-error-skeleton.c:208)
==1744221==    by 0x400DF99: dl_open_worker (dl-open.c:782)
==1744221==    by 0x4FD9A97: _dl_catch_exception (dl-error-skeleton.c:208)
==1744221==    by 0x400E34D: _dl_open (dl-open.c:883)
==1744221==    by 0x4FD9E00: do_dlopen (dl-libc.c:95)
==1744221==    by 0x4FD9A97: _dl_catch_exception (dl-error-skeleton.c:208)
==1744221==    by 0x4FD9B62: _dl_catch_error (dl-error-skeleton.c:227)
==1744221==    by 0x4FD9F36: dlerror_run (dl-libc.c:45)
==1744221==    by 0x4FD9F36: __libc_dlopen_mode (dl-libc.c:162)
==1744221==    by 0x4FBA0C9: module_load (nss_module.c:191)
==1744221== 
==1744221== 4,608 bytes in 16 blocks are possibly lost in loss record 1,274 of 1,330
==1744221==    at 0x484DA83: calloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1744221==    by 0x40147D9: calloc (rtld-malloc.h:44)
==1744221==    by 0x40147D9: allocate_dtv (dl-tls.c:375)
==1744221==    by 0x40147D9: _dl_allocate_tls (dl-tls.c:634)
==1744221==    by 0x4EFA7B4: allocate_stack (allocatestack.c:430)
==1744221==    by 0x4EFA7B4: pthread_create@@GLIBC_2.34 (pthread_create.c:647)
==1744221==    by 0xAC93C1: std::sys::pal::unix::thread::Thread::new (thread.rs:87)
==1744221==    by 0xA10B52: std::thread::Builder::spawn_unchecked_ (mod.rs:561)
==1744221==    by 0xA1001E: std::thread::Builder::spawn_unchecked (mod.rs:455)
==1744221==    by 0xA111DD: std::thread::Builder::spawn (mod.rs:387)
==1744221==    by 0xA002CB: tokio::runtime::blocking::pool::Spawner::spawn_thread (pool.rs:468)
==1744221==    by 0x9FFB98: tokio::runtime::blocking::pool::Spawner::spawn_task (pool.rs:418)
==1744221==    by 0x9FF69F: tokio::runtime::blocking::pool::Spawner::spawn_blocking_inner (pool.rs:386)
==1744221==    by 0x9FE999: tokio::runtime::blocking::pool::Spawner::spawn_blocking (pool.rs:301)
==1744221==    by 0xA2980C: tokio::runtime::handle::Handle::spawn_blocking (handle.rs:219)
==1744221== 
==1744221== LEAK SUMMARY:
==1744221==    definitely lost: 0 bytes in 0 blocks
==1744221==    indirectly lost: 0 bytes in 0 blocks
==1744221==      possibly lost: 10,996 bytes in 24 blocks
==1744221==    still reachable: 1,847,059 bytes in 17,930 blocks
==1744221==         suppressed: 0 bytes in 0 blocks
==1744221== Reachable blocks (those to which a pointer was found) are not shown.
==1744221== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==1744221== 
==1744221== ERROR SUMMARY: 8 errors from 8 contexts (suppressed: 0 from 0)
Illegal instruction

It's the same error (although I've not checked the core dump from k8s). I initially suspected rustls, but I've since changed my dependencies to use native-tls, to narrow down the suspects. Here's my complete Cargo.toml:

[package]
name = "f3-exterminator"
version = "0.1.0"
edition = "2021"

# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html

[dependencies]
hex = "0.4.3"
sqlx = { version = "0.7", features = [
  "postgres",
  "runtime-tokio",
  "tls-native-tls",
] }
thiserror = "1.0.58"
tokio = { version = "1.32.0", features = ["full"] }
tracing = { version = "0.1.40" }
tracing-subscriber = { version = "0.3.18", features = ["env-filter", "json"] }

Let me know if there's anything else I can do to help. Or, indeed, if I'm in the wrong thread/issue!

E: I forgot that you wanted rustc version too (and the lockfile). The above was compiled with rustc 1.79.0-nightly (8b2459c1f 2024-04-09) and the following .cargo/config.toml:

[build]
rustflags = ["-Clink-arg=-fuse-ld=mold", "-Zthreads=16"]

The above was built in debug mode. Building it in release mode with the same settings yields a binary that doesn't panic, but eventually times out:

{"timestamp":"2024-04-26T20:33:49.547408Z","level":"INFO","message":"starting f3 scan and execute","target":"f3_exterminator"}
{"timestamp":"2024-04-26T20:33:49.594806Z","level":"DEBUG","message":"establishing ia connection...","target":"f3_exterminator::ia","span":{"name":"IA::new"},"spans":[{"name":"IA::new"}]}
{"timestamp":"2024-04-26T20:33:51.887225Z","level":"DEBUG","message":"ia connection established successfully","target":"f3_exterminator::ia","span":{"name":"IA::new"},"spans":[{"name":"IA::new"}]}
{"timestamp":"2024-04-26T20:33:54.020785Z","level":"WARN","message":"slow statement: execution time exceeded alert threshold","summary":"REDACTED","rows_affected":89042,"rows_returned":89042,"elapsed":"1.99529552s","elapsed_secs":1.99529552,"slow_threshold":"1s","target":"sqlx::query","span":{"name":"IA::fetch_rows"},"spans":[{"name":"IA::fetch_rows"}]}
{"timestamp":"2024-04-26T20:33:54.492169Z","level":"INFO","message":"","target":"f3_exterminator::ia","span":{"dumped_tenant_length":40387,"duration":"2586","name":"IA::fetch_rows"},"spans":[{"dumped_tenant_length":40387,"duration":"2586","name":"IA::fetch_rows"}]}
{"timestamp":"2024-04-26T20:33:55.922100Z","level":"DEBUG","summary":"REDACTED","rows_affected":2806,"rows_returned":2806,"elapsed":"381.549211ms","elapsed_secs":0.381549211,"target":"sqlx::query","span":{"name":"B::fetch_rows"},"spans":[{"name":"B::fetch_rows"}]}
{"timestamp":"2024-04-26T20:33:55.939275Z","level":"INFO","message":"","target":"f3_exterminator::B","span":{"duration":"484","tenant_candidates_length":2806,"name":"B::fetch_rows"},"spans":[{"duration":"484","tenant_candidates_length":2806,"name":"B::fetch_rows"}]}
{"timestamp":"2024-04-26T20:33:56.870303Z","level":"DEBUG","summary":"REDACTED","rows_affected":62283,"rows_returned":62283,"elapsed":"850.407666ms","elapsed_secs":0.850407666,"target":"sqlx::query","span":{"name":"B::fetch_all_rows"},"spans":[{"name":"B::fetch_all_rows"}]}
{"timestamp":"2024-04-26T20:34:00.292812Z","level":"INFO","message":"gathered all tenants","target":"f3_exterminator::B","span":{"duration":"4348","fetch_all_rows_count":62283,"name":"B::fetch_all_rows"},"spans":[{"duration":"4348","fetch_all_rows_count":62283,"name":"B::fetch_all_rows"}]}
{"timestamp":"2024-04-26T20:34:25.410271Z","level":"INFO","message":"created connection","target":"f3_exterminator::f3","span":{REDACTED,"duration":"24711","name":"F3Connection::new"},"spans":[{REDACTED,"duration":"24711","name":"F3Connection::new"}]}
...
{"timestamp":"2024-04-26T20:37:00.765590Z","level":"ERROR","message":"failed to scan f3 properly, exiting pool timed out while waiting for an open connection","target":"f3_exterminator"}

Building it with rustc 1.77.2 (25ef9e3d8 2024-04-09) and the same .cargo/config.toml (with the exception of Zthreads=16, of course):

Debug Yields the same slow query behaviour as release mode, with an eventual timeout.

Release Slightly faster, but still extremely slow. Eventual timeout.

It should be noted that the code is a refactor of a simpler approach, that looked like this:

pub(crate) async fn scan_f3(tx: Sender<Message<F3Executioner>>) {
    let clusters = constants::Cluster::get_hosts();

    for cluster in clusters {
        let tx = tx.clone();
        tokio::task::spawn(async move {
            for db in cluster.databases {
                tracing::info!("queue scans for {db} in {cluster}");

                let conn = match F3Connection::new(cluster, db, tx).await {
                    Ok(conn) => conn,
                    Err(e) => {
                        let msg: String = format!("{e}");
                        tracing::debug!("tried to connect to 'schemas_cust' or 'schemas_tech'");

                        // this is not an error, they simply do not exist in every cluster
                        if msg.contains("\"schemas_cust\" does not exist")
                            || msg.contains("\"schemas_tech\" does not exist")
                        {
                            continue;
                        }

                        tracing::error!("failed to connec to {cluster}:{db} -> {e}");
                        continue;
                    }
                };

                match conn.scan_f3_tenants().await {
                    Ok(_) => tracing::info!("scans for {db} queued"),
                    Err(e) => tracing::error!("failed to scan {db} in {cluster}: {e}"),
                }
            }
        });
    }
}

Rebuilding the above in rustc 1.77.2 (release) yields a slow but working binary. It's fairly clear that valgrind affects performance, but everything else being equal, the refactored solution is 10x slower.

I don't know how much use this data is to you, but I figured I could at least leave it here.

Cargo.lock.gz

lifeRobot commented 1 month ago

Hello, I encountered this situation again during the stress test, and the error message is Prev.ref_comunt()>=1
Rust version: 1.80.1
tokio version: 1.39.2
os version: Windows11 version="10.0.22621.2506" business is similar to the following logic:

#[tokio::main]
pub async fn main() {
    let listener = TcpListener::bind("0.0.0.0:8080").await.unwrap();

    loop {
        let (tcp_stream, _) = listener.accept().await.unwrap();

        tokio::spawn(async move {
            let (mut read, write) = tcp_stream.into_split();
            let write = Arc::new(Mutex::new(write));
            let mut buf = [0; 1024];
            loop {
                let len = read.read(&mut buf).await.unwrap();
                if len == 0 {
                    eprintln!("read zero error");
                    break;
                }

                // handling a single business, let the reading logic continue reading
                let write = write.clone();
                tokio::spawn(async move {
                    // business logic
                    // This is similar code, but the comments section may handle more complex business logic, such as querying databases, etc
                    // ...

                    // logic done, recv to client
                    write.lock().await.write_all(b"business success/fail").await.unwrap();
                });
            }
        });
    }
}

error msg:

thread 'tokio-runtime-worker' panicked at E:\dev\rust\cargo\registry\src\index.crates.io-6f17d22bba15001f\tokio-1.39.2\src\runtime\task\state.rs:459:9:
assertion failed: prev.ref_count() >= 1
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

other instructions:

The stress testing steps involve writing a testing client that continuously sends business interface data to the service
The program ran normally at the beginning, so I kept running the stress test until the next day when I went to work. I found that the service program had crashed around 3am, while the stress test program was still running, only constantly reporting errors indicating that TCP was not connected

lifeRobot commented 1 month ago

Hello, I encountered this situation again during the stress test, and the error message is Prev.ref_comunt()>=1 Rust version: 1.80.1 tokio version: 1.39.2 os version: Windows11 version="10.0.22621.2506" business is similar to the following logic:
#[tokio::main]
pub async fn main() {
    let listener = TcpListener::bind("0.0.0.0:8080").await.unwrap();

    loop {
        let (tcp_stream, _) = listener.accept().await.unwrap();

        tokio::spawn(async move {
            let (mut read, write) = tcp_stream.into_split();
            let write = Arc::new(Mutex::new(write));
            let mut buf = [0; 1024];
            loop {
                let len = read.read(&mut buf).await.unwrap();
                if len == 0 {
                    eprintln!("read zero error");
                    break;
                }

                // handling a single business, let the reading logic continue reading
                let write = write.clone();
                tokio::spawn(async move {
                    // business logic
                    // This is similar code, but the comments section may handle more complex business logic, such as querying databases, etc
                    // ...

                    // logic done, recv to client
                    write.lock().await.write_all(b"business success/fail").await.unwrap();
                });
            }
        });
    }
}
error msg:
thread 'tokio-runtime-worker' panicked at E:\dev\rust\cargo\registry\src\index.crates.io-6f17d22bba15001f\tokio-1.39.2\src\runtime\task\state.rs:459:9:
assertion failed: prev.ref_count() >= 1
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
other instructions:

The stress testing steps involve writing a testing client that continuously sends business interface data to the service

The program ran normally at the beginning, so I kept running the stress test until the next day when I went to work. I found that the service program had crashed around 3am, while the stress test program was still running, only constantly reporting errors indicating that TCP was not connected

I have checked the system logs of Windows and the crash log shows 0xc0000409, which may be due to program heap or stack overflow:

Darksonn commented 1 month ago

@lifeRobot Does the example you shared reproduce the error, or is it only similar to what triggered the error?

lifeRobot commented 1 month ago

@lifeRobot Does the example you shared reproduce the error, or is it only similar to what triggered the error?

Just a similar code example, because I replaced the business logic with comments, and Tokio was also used for database queries. I cannot rule out the possibility of this exception being caused by other business logic at the moment

And this anomaly only occurred during the pressure test. If there is no data interaction, the program is very stable

lifeRobot commented 3 weeks ago

@lifeRobot Does the example you shared reproduce the error, or is it only similar to what triggered the error?

Hello, I have added some logs in tokio::runtime::task::state.rs and found that when the prev.ref_count() >= 1 panic was triggered, the droprejoin handle_fast and ref_dec methods were triggered twice. Can you take a look at these logs and see if they are helpful to you

Screenshot of the log that triggered the panic

changed these codes

tokio Cargo.toml add uuid: ```toml uuid = "1.10.0" ``` tokio::runtime::task::state.rs file add uuid and log: ```rust pub(super) struct State { id: String, val: AtomicUsize, } impl State { pub(super) fn new() -> State { // The raw task returned by this method has a ref-count of three. See // the comment on INITIAL_STATE for more. let id = uuid::Uuid::new_v4().to_string(); log::info!("id[{id}] new"); State { id, val: AtomicUsize::new(INITIAL_STATE), } } pub(super) fn transition_to_complete(&self) -> Snapshot { const DELTA: usize = RUNNING | COMPLETE; let prev = Snapshot(self.val.fetch_xor(DELTA, AcqRel)); log::info!("id[{}] transition_to_complete running is {}, complete is {}",self.id,prev.is_running(),prev.is_complete()); assert!(prev.is_running()); assert!(!prev.is_complete()); Snapshot(prev.0 ^ DELTA) } pub(super) fn transition_to_terminal(&self, count: usize) -> bool { let prev = Snapshot(self.val.fetch_sub(count * REF_ONE, AcqRel)); log::info!("id[{}] transition_to_terminal ref_count is {}",self.id,prev.ref_count()); assert!( prev.ref_count() >= count, "current: {}, sub: {}", prev.ref_count(), count ); prev.ref_count() == count } pub(super) fn drop_join_handle_fast(&self) -> Result<(), ()> { use std::sync::atomic::Ordering::Relaxed; // Relaxed is acceptable as if this function is called and succeeds, // then nothing has been done w/ the join handle. // // The moment the join handle is used (polled), the `JOIN_WAKER` flag is // set, at which point the CAS will fail. // // Given this, there is no risk if this operation is reordered. log::info!("id[{}] drop_join_handle_fast",self.id); self.val .compare_exchange_weak( INITIAL_STATE, (INITIAL_STATE - REF_ONE) & !JOIN_INTEREST, Release, Relaxed, ) .map(|_| ()) .map_err(|_| ()) } pub(super) fn ref_inc(&self) { use std::process; use std::sync::atomic::Ordering::Relaxed; // Using a relaxed ordering is alright here, as knowledge of the // original reference prevents other threads from erroneously deleting // the object. // // As explained in the [Boost documentation][1], Increasing the // reference counter can always be done with memory_order_relaxed: New // references to an object can only be formed from an existing // reference, and passing an existing reference from one thread to // another must already provide any required synchronization. // // [1]: (www.boost.org/doc/libs/1_55_0/doc/html/atomic/usage_examples.html) let prev = self.val.fetch_add(REF_ONE, Relaxed); log::info!("id[{}] ref_inc prev is {prev}",self.id); // If the reference count overflowed, abort. if prev > isize::MAX as usize { process::abort(); } } pub(super) fn ref_dec(&self) -> bool { let prev = Snapshot(self.val.fetch_sub(REF_ONE, AcqRel)); log::info!("id[{}] ref_dec ref_count is {}",self.id,prev.ref_count()); assert!(prev.ref_count() >= 1); prev.ref_count() == 1 } pub(super) fn ref_dec(&self) -> bool { let prev = Snapshot(self.val.fetch_sub(REF_ONE, AcqRel)); log::info!("id[{}] ref_dec ref_count is {}",self.id,prev.ref_count()); assert!(prev.ref_count() >= 1); prev.ref_count() == 1 } fn fetch_update(&self, mut f: F) -> Result where F: FnMut(Snapshot) -> Option, { let mut curr = self.load(); loop { let next = match f(curr) { Some(next) => next, None => return Err(curr), }; let res = self.val.compare_exchange(curr.0, next.0, AcqRel, Acquire); log::info!("id[{}] fetch_update res is {res:?}",self.id); match res { Ok(_) => return Ok(next), Err(actual) => curr = Snapshot(actual), } } } } ```

The complete state.rs file after modification: state.rs

log when triggering a panic(part): painc_log.txt
This log extracts the uuid that triggered the panic

log when triggering a panic(all): painc_all_log.txt
This log is the complete log from the creation of the uuid that triggered the panic to the triggering of the panic.
Panic triggered by d684d95-7481-4816-a007-7fc3a10782aa

Darksonn commented 3 weeks ago

That sounds like a bug in your code or another dependency. The only place calling drop_join_handle_fast is the destructor of JoinHandle. That doesn't happen twice unless you're incorrectly destroying the JoinHandle twice, which is not allowed.

tokio-rs / tokio

Many people report scheduler panics recently #6424