storaged-project / udisks

The UDisks project provides a daemon, tools and libraries to access and manipulate disks, storage devices and technologies.
https://storaged.org/doc/udisks2-api/latest/
Other
348 stars 142 forks source link

stack smashing detected in bd_nvme_get_sanitize_log() #1152

Closed marcbres closed 1 year ago

marcbres commented 1 year ago

Good Morning,

After upgrading to 2.10.0-3 the udisks2 service on debian testing does not start. It makes sddm to delay its presentation for about 15 seconds and plasma desktop to not work properly (I can log in but does not load any of the plasma features, so no task bar, and unable to open any program)

Reverting back manually to 2.9.4 fixes the problems.

I've opened a message to debian maintainers and after reviewing the logs they asked me to open it here. I've added the traces to other bug, but seems that is not the same, therefore I am opening it here now. Here are the logs and the gdb trace:

systemctl status

de jul. 22 22:53:20 helm systemd[1]: Starting udisks2.service - Disk Manager...
de jul. 22 22:53:20 helm udisksd[1286]: udisks daemon version 2.10.0 starting
de jul. 22 22:53:20 helm udisksd[1286]: Error getting 'loop6' information: Failed to get status of the device loop6: No such device or address (g-bd-loop-error-quark, 1)
de jul. 22 22:53:20 helm udisksd[1286]: Error getting 'loop7' information: Failed to get status of the device loop7: No such device or address (g-bd-loop-error-quark, 1)
de jul. 22 22:53:20 helm udisksd[1286]: Error getting 'loop6' information: Failed to get status of the device loop6: No such device or address (g-bd-loop-error-quark, 1)
de jul. 22 22:53:20 helm udisksd[1286]: Error getting 'loop7' information: Failed to get status of the device loop7: No such device or address (g-bd-loop-error-quark, 1)
de jul. 22 22:53:20 helm udisksd[1286]: *** stack smashing detected ***: terminated
de jul. 22 22:53:20 helm systemd[1]: udisks2.service: Main process exited, code=killed, status=6/ABRT
de jul. 22 22:53:20 helm systemd[1]: udisks2.service: Failed with result 'signal'.
de jul. 22 22:53:20 helm systemd[1]: Failed to start udisks2.service - Disk Manager.

gdb backtrace

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7ffff6bd76c0 (LWP 3418)]
[New Thread 0x7ffff63d66c0 (LWP 3419)]
[New Thread 0x7ffff5bd56c0 (LWP 3420)]
[New Thread 0x7ffff53d46c0 (LWP 3421)]
[New Thread 0x7fffe7fff6c0 (LWP 3422)]

Thread 1 "udisksd" received signal SIGABRT, Aborted.
__pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
#1  0x00007ffff758115f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007ffff7533472 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007ffff751d4b2 in __GI_abort () at ./stdlib/abort.c:79
#4  0x00007ffff751e1ed in __libc_message (fmt=fmt@entry=0x7ffff7690543 "*** %s ***: terminated\n") at ../sysdeps/posix/libc_fatal.c:150
#5  0x00007ffff760f2c5 in __GI___fortify_fail (msg=msg@entry=0x7ffff769052b "stack smashing detected") at ./debug/fortify_fail.c:24
#6  0x00007ffff760f2b0 in __stack_chk_fail () at ./debug/stack_chk_fail.c:24
#7  0x00007ffff45ba732 in bd_nvme_get_sanitize_log (device=0x555555634950 "/dev/nvme0", error=0x0) at ./src/plugins/nvme/nvme-info.c:1084
#8  0x0000000000000000 in  ()
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
        tid = <optimized out>
        ret = 0
        pd = <optimized out>
        old_mask = {__val = {0}}
        ret = <optimized out>
#1  0x00007ffff758115f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007ffff7533472 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#3  0x00007ffff751d4b2 in __GI_abort () at ./stdlib/abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x20, sa_sigaction = 0x20}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0}
#4  0x00007ffff751e1ed in __libc_message (fmt=fmt@entry=0x7ffff7690543 "*** %s ***: terminated\n") at ../sysdeps/posix/libc_fatal.c:150
        ap = {{gp_offset = 16, fp_offset = 0, overflow_arg_area = 0x7fffffffd5d0, reg_save_area = 0x7fffffffd560}}
        fd = <optimized out>
        list = <optimized out>
        nlist = <optimized out>
        cp = <optimized out>
#5  0x00007ffff760f2c5 in __GI___fortify_fail (msg=msg@entry=0x7ffff769052b "stack smashing detected") at ./debug/fortify_fail.c:24
#6  0x00007ffff760f2b0 in __stack_chk_fail () at ./debug/stack_chk_fail.c:24
#7  0x00007ffff45ba732 in bd_nvme_get_sanitize_log (device=0x555555634950 "/dev/nvme0", error=0x0) at ./src/plugins/nvme/nvme-info.c:1084
        ret = <optimized out>
        fd = <optimized out>
        sanitize_log = {sprog = 65535, sstat = 1, scdw10 = 0, eto = 0, etbe = 0, etce = 0, etond = 0, etbend = 0, etcend = 0, rsvd32 = '\000' <repeats 479 times>}
        log = <optimized out>
        sstat = <optimized out>
#8  0x0000000000000000 in  ()

Thread 6 (Thread 0x7fffe7fff6c0 (LWP 3422) "probing-thread"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007ffff778c260 in g_cond_wait () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2  0x00007ffff76fcefb in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x00007ffff76fd387 in g_async_queue_pop () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#4  0x0000555555577fb9 in probe_request_thread_func (user_data=<optimized out>) at ./src/udiskslinuxprovider.c:292
        provider = 0x555555612180
        request = <optimized out>
        dev_initialized = <optimized out>
        n_tries = <optimized out>
#5  0x00007ffff775ee9d in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#6  0x00007ffff757f3ec in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
        ret = <optimized out>
        pd = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737343123760, -4673221325080522273, -432, 0, 140737488343648, 140737077309440, 4673238917700096479, 4673205176068444639}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#7  0x00007ffff75ffa1c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 5 (Thread 0x7ffff53d46c0 (LWP 3421) "gdbus"):
#0  0x00007ffff75f29ef in __GI___poll (fds=0x7fffe0000b90, nfds=2, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
        sc_ret = -516
        sc_cancel_oldtype = 0
#1  0x00007ffff773285e in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2  0x00007ffff7732b9f in g_main_loop_run () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x00007ffff79982c6 in  () at /lib/x86_64-linux-gnu/libgio-2.0.so.0
#4  0x00007ffff775ee9d in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5  0x00007ffff757f3ec in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
        ret = <optimized out>
        pd = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737343123760, -4673221325080522273, -432, 11, 140737316210928, 140737299431424, 4673200999044452831, 4673205176068444639}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#6  0x00007ffff75ffa1c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 4 (Thread 0x7ffff5bd56c0 (LWP 3420) "pool-udisksd"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007ffff778c40c in g_cond_wait_until () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2  0x00007ffff76fced3 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x00007ffff775f85a in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#4  0x00007ffff775ee9d in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5  0x00007ffff757f3ec in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
        ret = <optimized out>
        pd = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737343123760, -4673221325080522273, -432, 0, 140737324603888, 140737307824128, 4673199898995954143, 4673205176068444639}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#6  0x00007ffff75ffa1c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 3 (Thread 0x7ffff63d66c0 (LWP 3419) "pool-spawner"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
#1  0x00007ffff778c260 in g_cond_wait () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2  0x00007ffff76fcefb in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x00007ffff775f4e2 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#4  0x00007ffff775ee9d in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5  0x00007ffff757f3ec in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
        ret = <optimized out>
        pd = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737343123760, -4673221325080522273, -432, 2, 140737488346848, 140737316216832, 4673203196993966559, 4673205176068444639}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#6  0x00007ffff75ffa1c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 2 (Thread 0x7ffff6bd76c0 (LWP 3418) "gmain"):
#0  0x00007ffff75f29ef in __GI___poll (fds=0x5555555e44d0, nfds=2, timeout=4630) at ../sysdeps/unix/sysv/linux/poll.c:29
        sc_ret = -516
        sc_cancel_oldtype = 0
#1  0x00007ffff773285e in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#2  0x00007ffff773297c in g_main_context_iteration () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#3  0x00007ffff77329c1 in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#4  0x00007ffff775ee9d in  () at /lib/x86_64-linux-gnu/libglib-2.0.so.0
#5  0x00007ffff757f3ec in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
        ret = <optimized out>
        pd = <optimized out>
        unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140737343123760, -4673221325080522273, -432, 2, 140737488346832, 140737324609536, 4673202096945467871, 4673205176068444639}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
        not_first_call = <optimized out>
#6  0x00007ffff75ffa1c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81

Thread 1 (Thread 0x7ffff709f940 (LWP 3415) "udisksd"):
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at ./nptl/pthread_kill.c:44
        tid = <optimized out>
        ret = 0
        pd = <optimized out>
        old_mask = {__val = {0}}
        ret = <optimized out>
#1  0x00007ffff758115f in __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  0x00007ffff7533472 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
        ret = <optimized out>
#3  0x00007ffff751d4b2 in __GI_abort () at ./stdlib/abort.c:79
        save_stage = 1
        act = {__sigaction_handler = {sa_handler = 0x20, sa_sigaction = 0x20}, sa_mask = {__val = {0 <repeats 16 times>}}, sa_flags = 0, sa_restorer = 0x0}
#4  0x00007ffff751e1ed in __libc_message (fmt=fmt@entry=0x7ffff7690543 "*** %s ***: terminated\n") at ../sysdeps/posix/libc_fatal.c:150
        ap = {{gp_offset = 16, fp_offset = 0, overflow_arg_area = 0x7fffffffd5d0, reg_save_area = 0x7fffffffd560}}
        fd = <optimized out>
        list = <optimized out>
        nlist = <optimized out>
        cp = <optimized out>
#5  0x00007ffff760f2c5 in __GI___fortify_fail (msg=msg@entry=0x7ffff769052b "stack smashing detected") at ./debug/fortify_fail.c:24
#6  0x00007ffff760f2b0 in __stack_chk_fail () at ./debug/stack_chk_fail.c:24
#7  0x00007ffff45ba732 in bd_nvme_get_sanitize_log (device=0x555555634950 "/dev/nvme0", error=0x0) at ./src/plugins/nvme/nvme-info.c:1084
        ret = <optimized out>
        fd = <optimized out>
        sanitize_log = {sprog = 65535, sstat = 1, scdw10 = 0, eto = 0, etbe = 0, etce = 0, etond = 0, etbend = 0, etcend = 0, rsvd32 = '\000' <repeats 479 times>}
        log = <optimized out>
        sstat = <optimized out>
#8  0x0000000000000000 in  ()

hope it helps to find the problem. This PC has 2 nvme units and 2 sata SSDs. As the error seems to be on some nvme related function I thought to mention it just in case it is useful

I've been asked to provide libnvme version and output of nvme sanitize-log -H /dev/nvme0, so here they go:

libnvme version: 1.5-1

nvme sanitize-log -H /dev/nvme0:

Progress                      (SPROG) :  65535
Sanitize Status                        (SSTAT) :  0x1
        [2:0]   Most Recent Sanitize Command Completed Successfully.
        [7:3]   Number of completed passes if most recent operation was overwrite:      0
          [8]   Global Data Erased cleared: a NS LB in the NVM subsystem has been written to or a PMR in the NVM subsystem has been enabled
Sanitize Command Dword 10 Information (SCDW10) :  0
Estimated Time For Overwrite                   :  0
Estimated Time For Block Erase                 :  0
Estimated Time For Crypto Erase                :  0
Estimated Time For Overwrite (No-Deallocate)   :  0
Estimated Time For Block Erase (No-Deallocate) :  0
Estimated Time For Crypto Erase (No-Deallocate):  0
*** stack smashing detected ***: terminated
Avortat

By this log seems clear that the stack smashing detected is caused by some part of the NVMe system and propagates upwards to udisks2, but I don't know how to proceed. If I revert back to udisks 2.9.4-4 it starts, but the nvme sanitize-log gives the same error. From what I understand, seems that 2.10.0 relies more on the surrounding libraries, and 2.9.4 is more "error ignoring friendly"

tbzatek commented 1 year ago

Thanks, this is useful. The issue is in libnvme then, udisks-2.10.0 made first use of it in this release.

Can you also provide full uname -a and your hardware architecture please?

@mbiebl is it possible to find out exact CFLAGS used (or a build log) for the libnvme-1.5-1 build? The nvme_get_log_sanitize() function is a simple ioctl() wrapper with the returned memory chunk mapped to struct nvme_sanitize_log_page. I sense a misalignment or size mismatch somewhere...

marcbres commented 1 year ago

uname -a: Linux helm 6.3.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.3.7-1 (2023-06-12) x86_64 GNU/Linux

By hardware architecture, I think you mean what devices have on it, I tell you the CPU and disks, if you need something else, tell me please. CPU: AMD Ryzen 5 5600X NVME disks: 1TB Gigabyte - linux root and /home partitions 1TB NE-1TB - windows ntfs partition for games <<-- This is the disk which causes the stack smashing when doing the nvme sanitize-log SSD Disks: 240GB P4-240 - Windows C: disk and EFI partition 960GB Kingston- Linux additional disk /media/hdd (for downloads and game installations)

mbiebl commented 1 year ago

@mbiebl is it possible to find out exact CFLAGS used (or a build log) for the libnvme-1.5-1 build? The nvme_get_log_sanitize() function is a simple ioctl() wrapper with the returned memory chunk mapped to struct nvme_sanitize_log_page. I sense a misalignment or size mismatch somewhere...

build logs can be found at https://buildd.debian.org/status/package.php?p=libnvme , e.g. https://buildd.debian.org/status/fetch.php?pkg=libnvme&arch=amd64&ver=1.5-1&stamp=1688172456&raw=0

mbiebl commented 1 year ago

Looking at https://git.progress-linux.org/users/daniel.baumann/debian/packages/libnvme/tree/debian/rules it seems to use the default dpkg provided build flags

$ dpkg-buildflags 
ASFLAGS=
CFLAGS=-g -O2 -ffile-prefix-map=/tmp/libnvme-1.5=. -fstack-protector-strong -Wformat -Werror=format-security
CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2
CXXFLAGS=-g -O2 -ffile-prefix-map=/tmp/libnvme-1.5=. -fstack-protector-strong -Wformat -Werror=format-security
DFLAGS=-frelease
FCFLAGS=-g -O2 -ffile-prefix-map=/tmp/libnvme-1.5=. -fstack-protector-strong
FFLAGS=-g -O2 -ffile-prefix-map=/tmp/libnvme-1.5=. -fstack-protector-strong
GCJFLAGS=-g -O2 -ffile-prefix-map=/tmp/libnvme-1.5=. -fstack-protector-strong
LDFLAGS=-Wl,-z,relro
OBJCFLAGS=-g -O2 -ffile-prefix-map=/tmp/libnvme-1.5=. -fstack-protector-strong -Wformat -Werror=format-security
OBJCXXFLAGS=-g -O2 -ffile-prefix-map=/tmp/libnvme-1.5=. -fstack-protector-strong -Wformat -Werror=format-security
mbiebl commented 1 year ago

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1041689

Adding the downstream bug report as cross reference

tbzatek commented 1 year ago

Thanks, this is no different from our Fedora builds with -fstack-protector-strong -fstack-clash-protection -fcf-protection:

CFLAGS='-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1  -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection'

I wonder what's so different on Debian builds.

mbiebl commented 1 year ago

-fstack-clash-protection

Could this option make a difference? The debian build seems to be missing that.

mbiebl commented 1 year ago

Looping in @igaw

igaw commented 1 year ago

Does nvme sanitize-log result in a stack smash? It's basically the same code as in https://github.com/storaged-project/libblockdev/blob/c79f84780dd73fe16d8a52550f74d8059af6c2e0/src/plugins/nvme/nvme-info.c#L1026

I checked the alignment of struct nvme_sanitize_log_page and it looks reasonable:

        __le16                     sprog;                /*     0     2 */
        __le16                     sstat;                /*     2     2 */
        __le32                     scdw10;               /*     4     4 */
        __le32                     eto;                  /*     8     4 */
        __le32                     etbe;                 /*    12     4 */
        __le32                     etce;                 /*    16     4 */
        __le32                     etond;                /*    20     4 */
        __le32                     etbend;               /*    24     4 */
        __le32                     etcend;               /*    28     4 */
        __u8                       rsvd32[480];          /*    32   480 */

        /* size: 512, cachelines: 8, members: 10 */
};

-fstack-clash-protection

still works for me, but I see that valgrind is complaining when I execute nvme sanitize-log though all the complaints are in the printf code of nvme-cli. But let me check what's going on there.

igaw commented 1 year ago

It looks like valgrind is not able to figure out that the buffer will be written by the kernel, thus it reports that all access to the buffer rely on uninitiated memory. When I preinitialize the struct all is good struct nvme_sanitize_log_page sanitize_log = { 0, };. And looking at the nvme-info.c file, the buffer is also initialized. So can't be the problem.

igaw commented 1 year ago

Does nvme sanitize-log result in a stack smash?

Never mind, I saw that in the Debian bugzilla this was already done and indeed it smashes the stack as well. Let me install a Debian system for testing.

igaw commented 1 year ago

I can't reproduce it. I got Debian SID installed in a VM and using the nvme emulation of qemu for testing. Compiled with the flags from above. Unfortunately, this doesn't implement the sanitize feature. Tried even with hacking around, still no 'luck'.

marcbres commented 1 year ago

I can still reproduce it, if you can tell me what I have to do, I will.

El dt., 25 de jul. 2023, 20:47, Daniel Wagner @.***> va escriure:

I can't reproduce it. I got Debian SID installed in a VM and using the nvme emulation of qemu for testing. Compiled with the flags from above. Unfortunately, this doesn't implement the sanitize feature. Tried even with hacking around, still no 'luck'.

— Reply to this email directly, view it on GitHub https://github.com/storaged-project/udisks/issues/1152#issuecomment-1650357960, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADI3OKTMUMYB5LWB3JOHAHDXSAIEZANCNFSM6AAAAAA2VYJLBU . You are receiving this because you authored the thread.Message ID: @.***>

allfoxwy commented 1 year ago

Does nvme sanitize-log result in a stack smash? It's basically the same code as in https://github.com/storaged-project/libblockdev/blob/c79f84780dd73fe16d8a52550f74d8059af6c2e0/src/plugins/nvme/nvme-info.c#L1026

Greetings. I think I'm facing the same bug here, and yes nvme sanitize-log would smash either.

# nvme sanitize-log /dev/nvme0
Sanitize Progress                      (SPROG) :  65535
Sanitize Status                        (SSTAT) :  0x1
Sanitize Command Dword 10 Information (SCDW10) :  0
Estimated Time For Overwrite                   :  4294967295 (No time period reported)
Estimated Time For Block Erase                 :  4294967295 (No time period reported)
Estimated Time For Crypto Erase                :  4294967295 (No time period reported)
Estimated Time For Overwrite (No-Deallocate)   :  0
Estimated Time For Block Erase (No-Deallocate) :  0
Estimated Time For Crypto Erase (No-Deallocate):  0
*** stack smashing detected ***: terminated
Aborted
allfoxwy commented 1 year ago

I'm start to hesitate it might be my SSD is the bug. Not sure, just for info:

# lspci | grep NVMe
01:00.0 Non-Volatile memory controller: MAXIO Technology (Hangzhou) Ltd. NVMe SSD Controller MAP1202 (rev 01)
igaw commented 1 year ago

@marcbres or @allfoxwy could you post the output of nvme id-ctrl -H /dev/nvme0?

Maybe the devices do not support SANICAP and thus we should not issue the command in the first place.

allfoxwy commented 1 year ago

could you post the output of nvme id-ctrl -H /dev/nvme0?

# nvme id-ctrl -H /dev/nvme0
NVME Identify Controller:
vid       : 0x1e4b
ssvid     : 0x1e4b
sn        : (hide from public)
mn        : J.ZAO QL SERIES 1TB SSD
fr        : 9139
rab       : 0
ieee      : 3a5a27
cmic      : 0
  [3:3] : 0     ANA not supported
  [2:2] : 0     PCI
  [1:1] : 0     Single Controller
  [0:0] : 0     Single Port

mdts      : 9
cntlid    : 0
ver       : 0x10400
rtd3r     : 0x186a0
rtd3e     : 0x1e8480
oaes      : 0x200
  [31:31] : 0   Discovery Log Change Notice Not Supported
  [27:27] : 0   Zone Descriptor Changed Notices Not Supported
  [15:15] : 0   Normal NSS Shutdown Event Not Supported
  [14:14] : 0   Endurance Group Event Aggregate Log Page Change Notice Not Supported
  [13:13] : 0   LBA Status Information Notices Not Supported
  [12:12] : 0   Predictable Latency Event Aggregate Log Change Notices Not Supported
  [11:11] : 0   Asymmetric Namespace Access Change Notices Not Supported
  [9:9] : 0x1   Firmware Activation Notices Supported
  [8:8] : 0     Namespace Attribute Changed Event Not Supported

ctratt    : 0
  [19:19] : 0   Flexible Data Placement Not Supported
  [15:15] : 0   Extended LBA Formats Not Supported
  [14:14] : 0   Delete NVM Set Not Supported
  [13:13] : 0   Delete Endurance Group Not Supported
  [12:12] : 0   Variable Capacity Management Not Supported
  [11:11] : 0   Fixed Capacity Management Not Supported
  [10:10] : 0   Multi Domain Subsystem Not Supported
  [9:9] : 0     UUID List Not Supported
  [8:8] : 0     SQ Associations Not Supported
  [7:7] : 0     Namespace Granularity Not Supported
  [6:6] : 0     Traffic Based Keep Alive Not Supported
  [5:5] : 0     Predictable Latency Mode Not Supported
  [4:4] : 0     Endurance Groups Not Supported
  [3:3] : 0     Read Recovery Levels Not Supported
  [2:2] : 0     NVM Sets Not Supported
  [1:1] : 0     Non-Operational Power State Permissive Not Supported
  [0:0] : 0     128-bit Host Identifier Not Supported

rrls      : 0
cntrltype : 1
  [7:2] : 0     Reserved
  [1:0] : 0x1   I/O Controller
fguid     : 00000000-0000-0000-0000-000000000000
crdt1     : 0
crdt2     : 0
crdt3     : 0
nvmsr     : 0
  [1:1] : 0     NVM subsystem Not part of an Enclosure
  [0:0] : 0     NVM subsystem Not part of a Storage Device

vwci      : 0
  [7:7] : 0     VPD Write Cycles Remaining field is Not valid.
  [6:0] : 0     VPD Write Cycles Remaining 

mec       : 0
  [1:1] : 0     NVM subsystem Not contains a Management Endpoint on a PCIe port
  [0:0] : 0     NVM subsystem Not contains a Management Endpoint on an SMBus/I2C port

oacs      : 0x17
  [10:10] : 0   Lockdown Command and Feature Not Supported
  [9:9] : 0     Get LBA Status Capability Not Supported
  [8:8] : 0     Doorbell Buffer Config Not Supported
  [7:7] : 0     Virtualization Management Not Supported
  [6:6] : 0     NVMe-MI Send and Receive Not Supported
  [5:5] : 0     Directives Not Supported
  [4:4] : 0x1   Device Self-test Supported
  [3:3] : 0     NS Management and Attachment Not Supported
  [2:2] : 0x1   FW Commit and Download Supported
  [1:1] : 0x1   Format NVM Supported
  [0:0] : 0x1   Security Send and Receive Supported

acl       : 2
aerl      : 3
frmw      : 0x14
  [5:5] : 0     Multiple FW or Boot Update Detection Not Supported
  [4:4] : 0x1   Firmware Activate Without Reset Supported
  [3:1] : 0x2   Number of Firmware Slots
  [0:0] : 0     Firmware Slot 1 Read/Write

lpa       : 0x2
  [6:6] : 0     Telemetry Log Data Area 4 Not Supported
  [5:5] : 0     LID 0x0, Scope of each command in LID 0x5, 0x12, 0x13 Not Supported
  [4:4] : 0     Persistent Event log Not Supported
  [3:3] : 0     Telemetry host/controller initiated log page Not Supported
  [2:2] : 0     Extended data for Get Log Page Not Supported
  [1:1] : 0x1   Command Effects Log Page Supported
  [0:0] : 0     SMART/Health Log Page per NS Not Supported

elpe      : 63
  [7:0] : 63 (0's based)        Error Log Page Entries (ELPE)

npss      : 4
  [7:0] : 4 (0's based) Number of Power States Support (NPSS)

avscc     : 0x1
  [0:0] : 0x1   Admin Vendor Specific Commands uses NVMe Format

apsta     : 0x1
  [0:0] : 0x1   Autonomous Power State Transitions Supported

wctemp    : 363
 [15:0] : 90 °C (363 K) Warning Composite Temperature Threshold (WCTEMP)

cctemp    : 368
 [15:0] : 95 °C (368 K) Critical Composite Temperature Threshold (CCTEMP)

mtfa      : 100
hmpre     : 4096
hmmin     : 4096
tnvmcap   : 1,024,209,543,168
[127:0] : 1,024,209,543,168
        Total NVM Capacity (TNVMCAP)

unvmcap   : 0
[127:0] : 0
        Unallocated NVM Capacity (UNVMCAP)

rpmbs     : 0
 [31:24]: 0     Access Size
 [23:16]: 0     Total Size
  [5:3] : 0     Authentication Method
  [2:0] : 0     Number of RPMB Units

edstt     : 5
dsto      : 1
fwug      : 1
kas       : 0
hctma     : 0x1
  [0:0] : 0x1   Host Controlled Thermal Management Supported

mntmt     : 373
 [15:0] : 100 °C (373 K)        Minimum Thermal Management Temperature (MNTMT)

mxtmt     : 388
 [15:0] : 115 °C (388 K)        Maximum Thermal Management Temperature (MXTMT)

sanicap   : 0x2
  [31:30] : 0   Additional media modification after sanitize operation completes successfully is not defined
  [29:29] : 0   No-Deallocate After Sanitize bit in Sanitize command Supported
    [2:2] : 0   Overwrite Sanitize Operation Not Supported
    [1:1] : 0x1 Block Erase Sanitize Operation Supported
    [0:0] : 0   Crypto Erase Sanitize Operation Not Supported

hmminds   : 4
hmmaxd    : 32
nsetidmax : 0
endgidmax : 0
anatt     : 0
anacap    : 0
  [7:7] : 0     Non-zero group ID Not Supported
  [6:6] : 0     Group ID does change
  [4:4] : 0     ANA Change state Not Supported
  [3:3] : 0     ANA Persistent Loss state Not Supported
  [2:2] : 0     ANA Inaccessible state Not Supported
  [1:1] : 0     ANA Non-optimized state Not Supported
  [0:0] : 0     ANA Optimized state Not Supported

anagrpmax : 0
nanagrpid : 0
pels      : 0
domainid  : 0
megcap    : 0
sqes      : 0x66
  [7:4] : 0x6   Max SQ Entry Size (64)
  [3:0] : 0x6   Min SQ Entry Size (64)

cqes      : 0x44
  [7:4] : 0x4   Max CQ Entry Size (16)
  [3:0] : 0x4   Min CQ Entry Size (16)

maxcmd    : 0
nn        : 1
oncs      : 0x1f
  [8:8] : 0     Copy Not Supported
  [7:7] : 0     Verify Not Supported
  [6:6] : 0     Timestamp Not Supported
  [5:5] : 0     Reservations Not Supported
  [4:4] : 0x1   Save and Select Supported
  [3:3] : 0x1   Write Zeroes Supported
  [2:2] : 0x1   Data Set Management Supported
  [1:1] : 0x1   Write Uncorrectable Supported
  [0:0] : 0x1   Compare Supported

fuses     : 0
  [0:0] : 0     Fused Compare and Write Not Supported

fna       : 0
  [3:3] : 0     Format NVM Broadcast NSID (FFFFFFFFh) Supported
  [2:2] : 0     Crypto Erase Not Supported as part of Secure Erase
  [1:1] : 0     Crypto Erase Applies to Single Namespace(s)
  [0:0] : 0     Format Applies to Single Namespace(s)

vwc       : 0x7
  [2:1] : 0x3   The Flush command supports NSID set to FFFFFFFFh
  [0:0] : 0x1   Volatile Write Cache Present

awun      : 0
awupf     : 0
icsvscc   : 0
  [0:0] : 0     NVM Vendor Specific Commands uses Vendor Specific Format

nwpc      : 0
  [2:2] : 0     Permanent Write Protect Not Supported
  [1:1] : 0     Write Protect Until Power Supply Not Supported
  [0:0] : 0     No Write Protect and Write Protect Namespace Not Supported

acwu      : 0
ocfs      : 0
  [1:1] : 0     Controller Copy Format 1h Not Supported
  [0:0] : 0     Controller Copy Format 0h Not Supported

sgls      : 0
 [15:8] : 0     SGL Descriptor Threshold
 [1:0]  : 0     Scatter-Gather Lists Not Supported

mnan      : 0
maxdna    : 0
maxcna    : 0
subnqn    : 
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
fcatt     : 0
  [0:0] : 0     Dynamic Controller Model

msdbd     : 0
ofcs      : 0
  [0:0] : 0     Disconnect command Not Supported

ps      0 : mp:6.50W operational enlat:0 exlat:0 rrt:0 rrl:0
            rwt:0 rwl:0 idle_power:- active_power:-
            active_power_workload:-
ps      1 : mp:5.80W operational enlat:0 exlat:0 rrt:1 rrl:1
            rwt:1 rwl:1 idle_power:- active_power:-
            active_power_workload:-
ps      2 : mp:3.60W operational enlat:0 exlat:0 rrt:2 rrl:2
            rwt:2 rwl:2 idle_power:- active_power:-
            active_power_workload:-
ps      3 : mp:0.7460W non-operational enlat:5000 exlat:10000 rrt:3 rrl:3
            rwt:3 rwl:3 idle_power:- active_power:-
            active_power_workload:-
ps      4 : mp:0.7260W non-operational enlat:8000 exlat:45000 rrt:4 rrl:4
            rwt:4 rwl:4 idle_power:- active_power:-
            active_power_workload:-
marcbres commented 1 year ago

I will do the nvme id-ctrl this afternoon when I get home, the NVMe disk that causes the stack smashing on my PC is a kingspec brand bought from aliexpress, while the other NVMe disk which is a Gigabyte brand it executes the sanitize-log without problems.

From the log of allfox, it seems that it is happening also on a chinese brand disk (J.ZAO). May be these chinese-brand disks are not following the "NVMe standard" specifications and this is causing the problems.

Anyway, this afternoon I will send the id-ctrl to see if it helps.

igaw commented 1 year ago

Okay, while at it, could you check if this commit is part of the kernel you are running:

4aedb705437f ("nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data")

There is a fix for this one which might be necessary: b8f6446b6853 ("nvme-pci: fix DMA direction of unmapping integrity data")

(btw, this is just wild guessing on my side)

allfoxwy commented 1 year ago

I'm running Debian Testing, with a kernel 6.3.0-1-amd64. I would need some time to check that commit.

allfoxwy commented 1 year ago

I can't find 4aedb705437f in https://github.com/torvalds/linux

For b8f6446b6853, I could see it is NOT in my kernel: https://sources.debian.org/src/linux/6.3.7-1/drivers/nvme/host/pci.c/#L972

igaw commented 1 year ago

4aedb705437f is part of v5.2

and b8f6446b6853 is part of the current merge window

but as I said, not necessary relevant. I just stumbled across while trying to read what's going on the lower levels

Well, one experiment is to check if v5.1 does also show this problem if you have time and the possibility. The v5.2 kernel had a bunch of refactoring in this area.

allfoxwy commented 1 year ago

I think there are some good news here.

I tried to build linux-4.19.289 (the last LTS before v5.2) and it seems work with udisks2. However linux-6.5-rc3 would still be stack smashing.

linux-4.19.289 would still be an error in its syslog, however without termination:

udisksd[810]: Error performing housekeeping for drive /org/freedesktop/UDisks2/drives/J_2eZAO_QL_SERIES_1TB_SSD_SNhideFromPublic: Error updating Health Information: No probed controller info available (udisks-error-quark, 0)
allfoxwy commented 1 year ago

So I think that maybe I could git bisect to find out which commit cause the problem.

But I failed to do so. There is a GCC update last year, Linux kernel between 4.20 and v5.17-rc5 will not build with Debian GCC 10, 11 nor 12. They would need another patch: https://github.com/torvalds/linux/commit/52a9dab6d892763b2a8334a568bd4e2c1a6fde66

I tried and failed to cherry pick this patch into linux-v4.20, GCC would complain another problem about some table address or something, I didn't dig in.

As 4.19.289 works, it might be a kernel bug. However I'm still not convinced. Maybe it's just my SSD got some strange problem with it. It is a cheap but branded one. I'll report back if I have more news.

igaw commented 1 year ago

Thanks for doing the testing. Just to clarify, Linux 4.19.289 works, but newer kernels break with the exact same userspace?

allfoxwy commented 1 year ago

On my end, linux-4.19.289 works. No userspace change.

About "newer", linux-4.19.289 is LTS, so it is newer than linux-5.1. So I couldn't define the bug position with "newer" or "elder".

All I tried: linux-4.19.289 works linux-6.5-rc3 stack smashing linux-5.17.0-rc4+ stack smashing <-- this is https://github.com/torvalds/linux/commit/52a9dab6d892763b2a8334a568bd4e2c1a6fde66 the elderest kernel I could build with stock Debian GCC

igaw commented 1 year ago

Alright, I'll post this info on the nvme kernel mailing list. Maybe someone spots the problem.

allfoxwy commented 1 year ago

Thanks for concern.

I think maybe it would be better to wait some input from marcbres as only a single sample from me could be just some wrong with my end.

igaw commented 1 year ago

No worries, if an older kernel works and a newer not, it's a strong indication that we have to look at the kernel.

https://lore.kernel.org/linux-nvme/dckvq3ohgqdkxg7wz4vslf36gaivgldl742vew67oyh2buj5nr@7ypn5ydt6kss/

marcbres commented 1 year ago

Here it is the nvme id-ctrl -H /dev/nvme1 for the drive that causes the smashing:

NVME Identify Controller:
vid       : 0x1e4b
ssvid     : 0x1e4b
sn        : 9110129M00467       
mn        : NE-1TB                                  
fr        : SN02486 
rab       : 6
ieee      : 000000
cmic      : 0
  [3:3] : 0     ANA not supported
  [2:2] : 0     PCI
  [1:1] : 0     Single Controller
  [0:0] : 0     Single Port

mdts      : 9
cntlid    : 0
ver       : 0x10300
rtd3r     : 0x7a120
rtd3e     : 0x1e8480
oaes      : 0x200
  [31:31] : 0   Discovery Log Change Notice Not Supported
  [27:27] : 0   Zone Descriptor Changed Notices Not Supported
  [15:15] : 0   Normal NSS Shutdown Event Not Supported
  [14:14] : 0   Endurance Group Event Aggregate Log Page Change Notice Not Supported
  [13:13] : 0   LBA Status Information Notices Not Supported
  [12:12] : 0   Predictable Latency Event Aggregate Log Change Notices Not Supported
  [11:11] : 0   Asymmetric Namespace Access Change Notices Not Supported
  [9:9] : 0x1   Firmware Activation Notices Supported
  [8:8] : 0     Namespace Attribute Changed Event Not Supported

ctratt    : 0
  [19:19] : 0   Flexible Data Placement Not Supported
  [15:15] : 0   Extended LBA Formats Not Supported
  [14:14] : 0   Delete NVM Set Not Supported
  [13:13] : 0   Delete Endurance Group Not Supported
  [12:12] : 0   Variable Capacity Management Not Supported
  [11:11] : 0   Fixed Capacity Management Not Supported
  [10:10] : 0   Multi Domain Subsystem Not Supported
  [9:9] : 0     UUID List Not Supported
  [8:8] : 0     SQ Associations Not Supported
  [7:7] : 0     Namespace Granularity Not Supported
  [6:6] : 0     Traffic Based Keep Alive Not Supported
  [5:5] : 0     Predictable Latency Mode Not Supported
  [4:4] : 0     Endurance Groups Not Supported
  [3:3] : 0     Read Recovery Levels Not Supported
  [2:2] : 0     NVM Sets Not Supported
  [1:1] : 0     Non-Operational Power State Permissive Not Supported
  [0:0] : 0     128-bit Host Identifier Not Supported

rrls      : 0
cntrltype : 0
  [7:2] : 0     Reserved
  [1:0] : 0     Controller type not reported
fguid     : 20202020-2020-2020-0000-000000000000
crdt1     : 0
crdt2     : 0
crdt3     : 0
nvmsr     : 0
  [1:1] : 0     NVM subsystem Not part of an Enclosure
  [0:0] : 0     NVM subsystem Not part of a Storage Device

vwci      : 0
  [7:7] : 0     VPD Write Cycles Remaining field is Not valid.
  [6:0] : 0     VPD Write Cycles Remaining 

mec       : 0
  [1:1] : 0     NVM subsystem Not contains a Management Endpoint on a PCIe port
  [0:0] : 0     NVM subsystem Not contains a Management Endpoint on an SMBus/I2C port

oacs      : 0x6
  [10:10] : 0   Lockdown Command and Feature Not Supported
  [9:9] : 0     Get LBA Status Capability Not Supported
  [8:8] : 0     Doorbell Buffer Config Not Supported
  [7:7] : 0     Virtualization Management Not Supported
  [6:6] : 0     NVMe-MI Send and Receive Not Supported
  [5:5] : 0     Directives Not Supported
  [4:4] : 0     Device Self-test Not Supported
  [3:3] : 0     NS Management and Attachment Not Supported
  [2:2] : 0x1   FW Commit and Download Supported
  [1:1] : 0x1   Format NVM Supported
  [0:0] : 0     Security Send and Receive Not Supported

acl       : 2
aerl      : 3
frmw      : 0x4
  [5:5] : 0     Multiple FW or Boot Update Detection Not Supported
  [4:4] : 0     Firmware Activate Without Reset Not Supported
  [3:1] : 0x2   Number of Firmware Slots
  [0:0] : 0     Firmware Slot 1 Read/Write

lpa       : 0x3
  [6:6] : 0     Telemetry Log Data Area 4 Not Supported
  [5:5] : 0     LID 0x0, Scope of each command in LID 0x5, 0x12, 0x13 Not Supported
  [4:4] : 0     Persistent Event log Not Supported
  [3:3] : 0     Telemetry host/controller initiated log page Not Supported
  [2:2] : 0     Extended data for Get Log Page Not Supported
  [1:1] : 0x1   Command Effects Log Page Supported
  [0:0] : 0x1   SMART/Health Log Page per NS Supported

elpe      : 63
  [7:0] : 63 (0's based)        Error Log Page Entries (ELPE)

npss      : 4
  [7:0] : 4 (0's based) Number of Power States Support (NPSS)

avscc     : 0x1
  [0:0] : 0x1   Admin Vendor Specific Commands uses NVMe Format

apsta     : 0x1
  [0:0] : 0x1   Autonomous Power State Transitions Supported

wctemp    : 393
 [15:0] : 120 °C (393 K)        Warning Composite Temperature Threshold (WCTEMP)

cctemp    : 403
 [15:0] : 130 °C (403 K)        Critical Composite Temperature Threshold (CCTEMP)

mtfa      : 100
hmpre     : 4096
hmmin     : 4096
tnvmcap   : 1.024.209.543.168
[127:0] : 1.024.209.543.168
        Total NVM Capacity (TNVMCAP)

unvmcap   : 0
[127:0] : 0
        Unallocated NVM Capacity (UNVMCAP)

rpmbs     : 0
 [31:24]: 0     Access Size
 [23:16]: 0     Total Size
  [5:3] : 0     Authentication Method
  [2:0] : 0     Number of RPMB Units

edstt     : 0
dsto      : 1
fwug      : 1
kas       : 0
hctma     : 0x1
  [0:0] : 0x1   Host Controlled Thermal Management Supported

mntmt     : 373
 [15:0] : 100 °C (373 K)        Minimum Thermal Management Temperature (MNTMT)

mxtmt     : 388
 [15:0] : 115 °C (388 K)        Maximum Thermal Management Temperature (MXTMT)

sanicap   : 0x2
  [31:30] : 0   Additional media modification after sanitize operation completes successfully is not defined
  [29:29] : 0   No-Deallocate After Sanitize bit in Sanitize command Supported
    [2:2] : 0   Overwrite Sanitize Operation Not Supported
    [1:1] : 0x1 Block Erase Sanitize Operation Supported
    [0:0] : 0   Crypto Erase Sanitize Operation Not Supported

hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
endgidmax : 0
anatt     : 0
anacap    : 0
  [7:7] : 0     Non-zero group ID Not Supported
  [6:6] : 0     Group ID does change
  [4:4] : 0     ANA Change state Not Supported
  [3:3] : 0     ANA Persistent Loss state Not Supported
  [2:2] : 0     ANA Inaccessible state Not Supported
  [1:1] : 0     ANA Non-optimized state Not Supported
  [0:0] : 0     ANA Optimized state Not Supported

anagrpmax : 0
nanagrpid : 0
pels      : 0
domainid  : 0
megcap    : 0
sqes      : 0x66
  [7:4] : 0x6   Max SQ Entry Size (64)
  [3:0] : 0x6   Min SQ Entry Size (64)

cqes      : 0x44
  [7:4] : 0x4   Max CQ Entry Size (16)
  [3:0] : 0x4   Min CQ Entry Size (16)

maxcmd    : 0
nn        : 1
oncs      : 0xf
  [8:8] : 0     Copy Not Supported
  [7:7] : 0     Verify Not Supported
  [6:6] : 0     Timestamp Not Supported
  [5:5] : 0     Reservations Not Supported
  [4:4] : 0     Save and Select Not Supported
  [3:3] : 0x1   Write Zeroes Supported
  [2:2] : 0x1   Data Set Management Supported
  [1:1] : 0x1   Write Uncorrectable Supported
  [0:0] : 0x1   Compare Supported

fuses     : 0
  [0:0] : 0     Fused Compare and Write Not Supported

fna       : 0x3
  [3:3] : 0     Format NVM Broadcast NSID (FFFFFFFFh) Supported
  [2:2] : 0     Crypto Erase Not Supported as part of Secure Erase
  [1:1] : 0x1   Crypto Erase Applies to All Namespace(s)
  [0:0] : 0x1   Format Applies to All Namespace(s)

vwc       : 0x1
  [2:1] : 0     Support for the NSID field set to FFFFFFFFh is not indicated
  [0:0] : 0x1   Volatile Write Cache Present

awun      : 0
awupf     : 0
icsvscc   : 0
  [0:0] : 0     NVM Vendor Specific Commands uses Vendor Specific Format

nwpc      : 0
  [2:2] : 0     Permanent Write Protect Not Supported
  [1:1] : 0     Write Protect Until Power Supply Not Supported
  [0:0] : 0     No Write Protect and Write Protect Namespace Not Supported

acwu      : 0
ocfs      : 0
  [1:1] : 0     Controller Copy Format 1h Not Supported
  [0:0] : 0     Controller Copy Format 0h Not Supported

sgls      : 0
 [15:8] : 0     SGL Descriptor Threshold
 [1:0]  : 0     Scatter-Gather Lists Not Supported

mnan      : 0
maxdna    : 0
maxcna    : 0
subnqn    : 
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
fcatt     : 0
  [0:0] : 0     Dynamic Controller Model

msdbd     : 0
ofcs      : 0
  [0:0] : 0     Disconnect command Not Supported

ps      0 : mp:6.50W operational enlat:0 exlat:0 rrt:0 rrl:0
            rwt:0 rwl:0 idle_power:- active_power:-
            active_power_workload:-
ps      1 : mp:5.80W operational enlat:0 exlat:0 rrt:1 rrl:1
            rwt:1 rwl:1 idle_power:- active_power:-
            active_power_workload:-
ps      2 : mp:3.60W operational enlat:0 exlat:0 rrt:2 rrl:2
            rwt:2 rwl:2 idle_power:- active_power:-
            active_power_workload:-
ps      3 : mp:0.0500W non-operational enlat:5000 exlat:10000 rrt:3 rrl:3
            rwt:3 rwl:3 idle_power:- active_power:-
            active_power_workload:-
ps      4 : mp:0.0025W non-operational enlat:5000 exlat:45000 rrt:4 rrl:4
            rwt:4 rwl:4 idle_power:- active_power:-
            active_power_workload:-

As you mentioned early igaw, it seems that this drive does not support some sanitize options:

  [31:30] : 0   Additional media modification after sanitize operation completes successfully is not defined
  [29:29] : 0   No-Deallocate After Sanitize bit in Sanitize command Supported
    [2:2] : 0   Overwrite Sanitize Operation Not Supported
    [1:1] : 0x1 Block Erase Sanitize Operation Supported
    [0:0] : 0   Crypto Erase Sanitize Operation Not Supported

I'm also running debian testing with kernel 6.3.0-1-amd64

igaw commented 1 year ago

As you mentioned early igaw, it seems that this drive does not support some sanitize options:

that is okay, if the whole register of sanicap is not 0, than it's supported. In your case the 'Block Erase Sanitize Operation' is supported the others not.

I got some feedback from the mailing list. Need to figure out what to do with it. The current assumption is that the direct mapping of the DMA buffers to userspace are the culprint and the devices are not behaving correctly. That means we tell it to transfer 512 bytes but it transfers more, e.g. 4k. That would smash the stack.

tbzatek commented 1 year ago
udisksd[810]: Error performing housekeeping for drive /org/freedesktop/UDisks2/drives/J_2eZAO_QL_SERIES_1TB_SSD_SNhideFromPublic: Error updating Health Information: No probed controller info available (udisks-error-quark, 0)

This indicates that nvme_identify_ctrl() failed earlier in the process. In such case udisks doesn't submit any nvme_get_log_sanitize() calls. Are you sure that nvme id-ctrl /dev/nvme0 actually works for you on 4.19.289?

Maybe the devices do not support SANICAP and thus we should not issue the command in the first place.

Good point. We actually check for controller capabilities (struct nvme_id_ctrl.sanicap) first, same for the self-test log.

allfoxwy commented 1 year ago

Are you sure that nvme id-ctrl /dev/nvme0 actually works for you on 4.19.289?

Yes, on 4.19.289 both nvme sanitize-log /dev/nvme0 and nvme id-ctrl /dev/nvme0 would show result with exit code 0 and no stack smashing message.

And I could see /usr/libexec/udisks2/udisksd is running.

If I plug in a USB flash drive, it would be automatically mounted.

marcbres commented 1 year ago

Good Morning,

I've moved the "faulting" disk to another pc running debian stable. It mounts well, as on the original PC with debian testing.

# uname -a
Linux gondor.bres.cat 6.1.0-10-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.38-1 (2023-07-14) x86_64 GNU/Linux

And libnvme 1.3-1, and there, when executing the sanitize-log function returns an error, but does not produce the stack smashing:

# nvme sanitize-log -H /dev/nvme0
NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x2)

I'm posting this, maybe it's somewhat useful

igaw commented 1 year ago

Thanks for this test. I think we can safely say it's the device which causing the issue. With kernel 6.1 and libnvme 1.3 it works. That is strange. Have to check the kernel configs for stable and sid.

[edit] Were there any addition kernel logs?

Anyway, I got some feedback from the mailing list. The old behavior of the kernel was to copy the buffers from kernel space to user space and enforcing all the size limits. Newer kernel support direct dma mappings to user space. Given this, it's possible that we request 512 bytes but the device writes more than those 512 bytes back and thus over writes the stack.

To verify this theory we can play with IOMMU mappings and/or some hacks so that we never do the direct mappings. But this is what you did with switching to Debian stable.

Anyway, I wont have time to work on this right now. I am off for a week.

marcbres commented 1 year ago

I've switched back to my main PC with Debian Testing, and doing the nvme sanitize-log to cause the stack smashing error creates the following entries on kern.log:

2023-07-29T07:48:08.655812+02:00 helm kernel: [  489.963214] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x0 flags=0x0020]
2023-07-29T07:48:08.655814+02:00 helm kernel: [  489.963220] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x80 flags=0x0020]
2023-07-29T07:48:08.655815+02:00 helm kernel: [  489.963223] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x100 flags=0x0020]
2023-07-29T07:48:08.655816+02:00 helm kernel: [  489.963229] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x180 flags=0x0020]
2023-07-29T07:48:08.655817+02:00 helm kernel: [  489.963234] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x200 flags=0x0020]
2023-07-29T07:48:08.655817+02:00 helm kernel: [  489.963239] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x280 flags=0x0020]
2023-07-29T07:48:08.655818+02:00 helm kernel: [  489.963242] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x300 flags=0x0020]
2023-07-29T07:48:08.655819+02:00 helm kernel: [  489.963245] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x380 flags=0x0020]
2023-07-29T07:48:08.655826+02:00 helm kernel: [  489.963248] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x400 flags=0x0020]
2023-07-29T07:48:08.655827+02:00 helm kernel: [  489.963251] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x480 flags=0x0020]

Hope this helps.

Anyway, I wont have time to work on this right now. I am off for a week.

No worries, I can still work with the PC if I don't upgrade the udisks2 pakcage, so for me is not urgent. Happy holidays!

phreer commented 1 year ago

I'm start to hesitate it might be my SSD is the bug. Not sure, just for info:

# lspci | grep NVMe
01:00.0 Non-Volatile memory controller: MAXIO Technology (Hangzhou) Ltd. NVMe SSD Controller MAP1202 (rev 01)

Hi folks, I have exactly the same nvme from J.ZAO and facing this issue after upgrading my machine running Debian Sid weeks ago. So I suppose it is highly related to the hardware.

tbzatek commented 1 year ago

Any chance to test this? https://github.com/storaged-project/libblockdev/pull/951

Wanted to avoid any kind of blacklist and this was a straight dumb way with the hope that 64k ought be enough :tm:.

allfoxwy commented 1 year ago

Any chance to test this? storaged-project/libblockdev#951

Wanted to avoid any kind of blacklist and this was a straight dumb way with the hope that 64k ought be enough ™️.

Greetings. I think this patch is working on my end. Now I could see udisks2 service is up and running. And it will mount USB flash drive automatically.

It would be better to have someone else to test the patch, so I would write what I tried: I downloaded https://github.com/tbzatek/libblockdev/tree/nvme-27 source zip package. Do "apt install autoconf-archive" as Debian's autoconf do not depend on autoconf-archive Run ./autogen.sh Run ./configure Do "apt install" for all those lib***-dev packages. Run ./configure again Run make Run make install Edit /lib/systemd/system/udisks2.service , add a line before ExecStart=, so that homebrew libblockdev would be used:

Environment=LD_LIBRARY_PATH=/usr/local/lib

Reboot the system.

Thanks for your help! I didn't know man could write a working patch without having the bug at hand .

Mc-Pain commented 1 year ago

All I tried: linux-4.19.289 works linux-6.5-rc3 stack smashing linux-5.17.0-rc4+ stack smashing <-- this is torvalds/linux@52a9dab the elderest kernel I could build with stock Debian GCC

It stopped working on 5.8 (5.7 is fine), bisected to commit:

But it looks like reverting it is not enough. Removing all blk_queue_dma_alignment() calls from drivers/nvme doesn't fix it though.

UPD Re-checked: removing blk_queue_dma_alignment() from nvme_set_queue_limits() helps

igaw commented 1 year ago

Thanks for bisecting. From looking into what the effects are, I'd say 3b2a1ebceba3 and 52fde2c07da just enables zero copy I/O instead using a bounce buffer (see https://github.com/torvalds/linux/blob/4853c74bd7ab7fdb83f319bd9ace8a08c031e9b6/block/blk-map.c#L626).

So you should get it working by just hacking blk_rq_map_user_iov always using bounce buffers.

Could you try this?

diff --git a/block/blk-map.c b/block/blk-map.c
index 44d74a30ddac..7a13e0fcba7f 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -627,7 +627,7 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
                        struct rq_map_data *map_data,
                        const struct iov_iter *iter, gfp_t gfp_mask)
 {
-       bool copy = false, map_bvec = false;
+       bool copy = true, map_bvec = false;
        unsigned long align = q->dma_pad_mask | queue_dma_alignment(q);
        struct bio *bio = NULL;
        struct iov_iter i;
Mc-Pain commented 1 year ago

Could you try this?

Nope. It doesn't help

UPD: it happens on certain disks, swapping with the working one helps. And the issue starts happened on another PC

igaw commented 1 year ago

D'oh, the copy/map_bvec hack works only if we assign the values after the setter block:

diff --git a/block/blk-map.c b/block/blk-map.c
index 44d74a30ddac..8f1dd793e825 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -646,6 +646,9 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
        else if (queue_virt_boundary(q))
                copy = queue_virt_boundary(q) & iov_iter_gap_alignment(iter);

+       map_bvec = false;
+       copy = true;
+
        if (map_bvec) {
                ret = blk_rq_map_user_bvec(rq, iter);
                if (!ret)
Mc-Pain commented 1 year ago

if (map_bvec) -- the condition will always be false.

UPD it worked

igaw commented 1 year ago

Great to hear. So nvme list doesn't crash and there are no kernel messages?

Mc-Pain commented 1 year ago

never tested with nvme list I did nvme sanitize-log /dev/nvme0n1

igaw commented 1 year ago

Typo, I wanted to write nvme sanitize-log. No entries in the kernel logs, right?

If you have some more test cycles you could enable all memory debug options especially KASAN. The kernel will likely report the overwrite then too.

allfoxwy commented 1 year ago

I tried new HACK on linux-6.4.11, it worked.

Both 'sudo nvme sanitize-log /dev/nvme0' and udisks2 would work without stack smashing.

I built the kernel with CONFIG_KASAN=y however there is no new info in /var/log/syslog when I request 'nvme sanitize-log'. Maybe some more setup is required than the Kernel building option.

And few days before, I sent an e-mail to MAXIO about this. Receiving no reply. I think it would be the best if the upstream controller chip maker could fix this at their side. And the Linux kernel could have some defending methods against this kind of problem, like the Kernel DMA protection in Windows. Just wish.

igaw commented 1 year ago

I built the kernel with CONFIG_KASAN=y however there is no new info in /var/log/syslog when I request 'nvme sanitize-log'. Maybe some more setup is required than the Kernel building option.

I need to check how this part of the code work, it might just allocate a bigger buffer and thus avoiding the problem. But thanks for testing. Highly appreciated.

And few days before, I sent an e-mail to MAXIO about this. Receiving no reply. I think it would be the best if the upstream controller chip maker could fix this at their side.

This is obviously the preferred solution, but not sure if you get any feedback. But thanks for trying!

And the Linux kernel could have some defending methods against this kind of problem, like the Kernel DMA protection in Windows. Just wish.

Indeed, we have now all the data to discuss a solution. Let's see what the outcome is on the nvme mailing list.

cvpsmith commented 1 year ago

Do "apt install" for all those lib***-dev packages. Run ./configure again

Man, you weren't EVEN joking there! ;)
Thanks for the VERY helpful how-to. Worked like a charm!

EDIT: For the record, udisks2.service runs now and everything works except "nvme sanitize-log /dev/nvme0n1" still produces a stack smash. This is on Ubuntu 23.10 mantic with a Fanxiang S500PRO 1TB (SN07939) NVMe drive.

EDIT2: I also noticed that, not only does this bug cause gnome-disks to stop working (which makes sense), but it ALSO causes Gnome=>Settings=>About to freeze up the Settings app entirely (that section reports disk capacity). Once udisks2 is running properly, the issue goes away...