Closed marcbres closed 1 year ago
Thanks, this is useful. The issue is in libnvme
then, udisks-2.10.0
made first use of it in this release.
Can you also provide full uname -a
and your hardware architecture please?
@mbiebl is it possible to find out exact CFLAGS used (or a build log) for the libnvme-1.5-1
build? The nvme_get_log_sanitize()
function is a simple ioctl()
wrapper with the returned memory chunk mapped to struct nvme_sanitize_log_page
. I sense a misalignment or size mismatch somewhere...
uname -a: Linux helm 6.3.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.3.7-1 (2023-06-12) x86_64 GNU/Linux
By hardware architecture, I think you mean what devices have on it, I tell you the CPU and disks, if you need something else, tell me please. CPU: AMD Ryzen 5 5600X NVME disks: 1TB Gigabyte - linux root and /home partitions 1TB NE-1TB - windows ntfs partition for games <<-- This is the disk which causes the stack smashing when doing the nvme sanitize-log SSD Disks: 240GB P4-240 - Windows C: disk and EFI partition 960GB Kingston- Linux additional disk /media/hdd (for downloads and game installations)
@mbiebl is it possible to find out exact CFLAGS used (or a build log) for the
libnvme-1.5-1
build? Thenvme_get_log_sanitize()
function is a simpleioctl()
wrapper with the returned memory chunk mapped tostruct nvme_sanitize_log_page
. I sense a misalignment or size mismatch somewhere...
build logs can be found at https://buildd.debian.org/status/package.php?p=libnvme , e.g. https://buildd.debian.org/status/fetch.php?pkg=libnvme&arch=amd64&ver=1.5-1&stamp=1688172456&raw=0
Looking at https://git.progress-linux.org/users/daniel.baumann/debian/packages/libnvme/tree/debian/rules it seems to use the default dpkg provided build flags
$ dpkg-buildflags
ASFLAGS=
CFLAGS=-g -O2 -ffile-prefix-map=/tmp/libnvme-1.5=. -fstack-protector-strong -Wformat -Werror=format-security
CPPFLAGS=-Wdate-time -D_FORTIFY_SOURCE=2
CXXFLAGS=-g -O2 -ffile-prefix-map=/tmp/libnvme-1.5=. -fstack-protector-strong -Wformat -Werror=format-security
DFLAGS=-frelease
FCFLAGS=-g -O2 -ffile-prefix-map=/tmp/libnvme-1.5=. -fstack-protector-strong
FFLAGS=-g -O2 -ffile-prefix-map=/tmp/libnvme-1.5=. -fstack-protector-strong
GCJFLAGS=-g -O2 -ffile-prefix-map=/tmp/libnvme-1.5=. -fstack-protector-strong
LDFLAGS=-Wl,-z,relro
OBJCFLAGS=-g -O2 -ffile-prefix-map=/tmp/libnvme-1.5=. -fstack-protector-strong -Wformat -Werror=format-security
OBJCXXFLAGS=-g -O2 -ffile-prefix-map=/tmp/libnvme-1.5=. -fstack-protector-strong -Wformat -Werror=format-security
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1041689
Adding the downstream bug report as cross reference
Thanks, this is no different from our Fedora builds with -fstack-protector-strong -fstack-clash-protection -fcf-protection
:
CFLAGS='-O2 -flto=auto -ffat-lto-objects -fexceptions -g -grecord-gcc-switches -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -Wp,-D_GLIBCXX_ASSERTIONS -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -fstack-protector-strong -specs=/usr/lib/rpm/redhat/redhat-annobin-cc1 -m64 -march=x86-64-v2 -mtune=generic -fasynchronous-unwind-tables -fstack-clash-protection -fcf-protection'
I wonder what's so different on Debian builds.
-fstack-clash-protection
Could this option make a difference? The debian build seems to be missing that.
Looping in @igaw
Does nvme sanitize-log
result in a stack smash? It's basically the same code as in https://github.com/storaged-project/libblockdev/blob/c79f84780dd73fe16d8a52550f74d8059af6c2e0/src/plugins/nvme/nvme-info.c#L1026
I checked the alignment of struct nvme_sanitize_log_page
and it looks reasonable:
__le16 sprog; /* 0 2 */
__le16 sstat; /* 2 2 */
__le32 scdw10; /* 4 4 */
__le32 eto; /* 8 4 */
__le32 etbe; /* 12 4 */
__le32 etce; /* 16 4 */
__le32 etond; /* 20 4 */
__le32 etbend; /* 24 4 */
__le32 etcend; /* 28 4 */
__u8 rsvd32[480]; /* 32 480 */
/* size: 512, cachelines: 8, members: 10 */
};
-fstack-clash-protection
still works for me, but I see that valgrind is complaining when I execute nvme sanitize-log
though all the complaints are in the printf code of nvme-cli. But let me check what's going on there.
It looks like valgrind is not able to figure out that the buffer will be written by the kernel, thus it reports that all access to the buffer rely on uninitiated memory. When I preinitialize the struct all is good struct nvme_sanitize_log_page sanitize_log = { 0, };
. And looking at the nvme-info.c
file, the buffer is also initialized. So can't be the problem.
Does
nvme sanitize-log
result in a stack smash?
Never mind, I saw that in the Debian bugzilla this was already done and indeed it smashes the stack as well. Let me install a Debian system for testing.
I can't reproduce it. I got Debian SID installed in a VM and using the nvme emulation of qemu for testing. Compiled with the flags from above. Unfortunately, this doesn't implement the sanitize feature. Tried even with hacking around, still no 'luck'.
I can still reproduce it, if you can tell me what I have to do, I will.
El dt., 25 de jul. 2023, 20:47, Daniel Wagner @.***> va escriure:
I can't reproduce it. I got Debian SID installed in a VM and using the nvme emulation of qemu for testing. Compiled with the flags from above. Unfortunately, this doesn't implement the sanitize feature. Tried even with hacking around, still no 'luck'.
— Reply to this email directly, view it on GitHub https://github.com/storaged-project/udisks/issues/1152#issuecomment-1650357960, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADI3OKTMUMYB5LWB3JOHAHDXSAIEZANCNFSM6AAAAAA2VYJLBU . You are receiving this because you authored the thread.Message ID: @.***>
Does
nvme sanitize-log
result in a stack smash? It's basically the same code as in https://github.com/storaged-project/libblockdev/blob/c79f84780dd73fe16d8a52550f74d8059af6c2e0/src/plugins/nvme/nvme-info.c#L1026
Greetings. I think I'm facing the same bug here, and yes nvme sanitize-log
would smash either.
# nvme sanitize-log /dev/nvme0
Sanitize Progress (SPROG) : 65535
Sanitize Status (SSTAT) : 0x1
Sanitize Command Dword 10 Information (SCDW10) : 0
Estimated Time For Overwrite : 4294967295 (No time period reported)
Estimated Time For Block Erase : 4294967295 (No time period reported)
Estimated Time For Crypto Erase : 4294967295 (No time period reported)
Estimated Time For Overwrite (No-Deallocate) : 0
Estimated Time For Block Erase (No-Deallocate) : 0
Estimated Time For Crypto Erase (No-Deallocate): 0
*** stack smashing detected ***: terminated
Aborted
I'm start to hesitate it might be my SSD is the bug. Not sure, just for info:
# lspci | grep NVMe
01:00.0 Non-Volatile memory controller: MAXIO Technology (Hangzhou) Ltd. NVMe SSD Controller MAP1202 (rev 01)
@marcbres or @allfoxwy could you post the output of nvme id-ctrl -H /dev/nvme0
?
Maybe the devices do not support SANICAP
and thus we should not issue the command in the first place.
could you post the output of
nvme id-ctrl -H /dev/nvme0
?
# nvme id-ctrl -H /dev/nvme0
NVME Identify Controller:
vid : 0x1e4b
ssvid : 0x1e4b
sn : (hide from public)
mn : J.ZAO QL SERIES 1TB SSD
fr : 9139
rab : 0
ieee : 3a5a27
cmic : 0
[3:3] : 0 ANA not supported
[2:2] : 0 PCI
[1:1] : 0 Single Controller
[0:0] : 0 Single Port
mdts : 9
cntlid : 0
ver : 0x10400
rtd3r : 0x186a0
rtd3e : 0x1e8480
oaes : 0x200
[31:31] : 0 Discovery Log Change Notice Not Supported
[27:27] : 0 Zone Descriptor Changed Notices Not Supported
[15:15] : 0 Normal NSS Shutdown Event Not Supported
[14:14] : 0 Endurance Group Event Aggregate Log Page Change Notice Not Supported
[13:13] : 0 LBA Status Information Notices Not Supported
[12:12] : 0 Predictable Latency Event Aggregate Log Change Notices Not Supported
[11:11] : 0 Asymmetric Namespace Access Change Notices Not Supported
[9:9] : 0x1 Firmware Activation Notices Supported
[8:8] : 0 Namespace Attribute Changed Event Not Supported
ctratt : 0
[19:19] : 0 Flexible Data Placement Not Supported
[15:15] : 0 Extended LBA Formats Not Supported
[14:14] : 0 Delete NVM Set Not Supported
[13:13] : 0 Delete Endurance Group Not Supported
[12:12] : 0 Variable Capacity Management Not Supported
[11:11] : 0 Fixed Capacity Management Not Supported
[10:10] : 0 Multi Domain Subsystem Not Supported
[9:9] : 0 UUID List Not Supported
[8:8] : 0 SQ Associations Not Supported
[7:7] : 0 Namespace Granularity Not Supported
[6:6] : 0 Traffic Based Keep Alive Not Supported
[5:5] : 0 Predictable Latency Mode Not Supported
[4:4] : 0 Endurance Groups Not Supported
[3:3] : 0 Read Recovery Levels Not Supported
[2:2] : 0 NVM Sets Not Supported
[1:1] : 0 Non-Operational Power State Permissive Not Supported
[0:0] : 0 128-bit Host Identifier Not Supported
rrls : 0
cntrltype : 1
[7:2] : 0 Reserved
[1:0] : 0x1 I/O Controller
fguid : 00000000-0000-0000-0000-000000000000
crdt1 : 0
crdt2 : 0
crdt3 : 0
nvmsr : 0
[1:1] : 0 NVM subsystem Not part of an Enclosure
[0:0] : 0 NVM subsystem Not part of a Storage Device
vwci : 0
[7:7] : 0 VPD Write Cycles Remaining field is Not valid.
[6:0] : 0 VPD Write Cycles Remaining
mec : 0
[1:1] : 0 NVM subsystem Not contains a Management Endpoint on a PCIe port
[0:0] : 0 NVM subsystem Not contains a Management Endpoint on an SMBus/I2C port
oacs : 0x17
[10:10] : 0 Lockdown Command and Feature Not Supported
[9:9] : 0 Get LBA Status Capability Not Supported
[8:8] : 0 Doorbell Buffer Config Not Supported
[7:7] : 0 Virtualization Management Not Supported
[6:6] : 0 NVMe-MI Send and Receive Not Supported
[5:5] : 0 Directives Not Supported
[4:4] : 0x1 Device Self-test Supported
[3:3] : 0 NS Management and Attachment Not Supported
[2:2] : 0x1 FW Commit and Download Supported
[1:1] : 0x1 Format NVM Supported
[0:0] : 0x1 Security Send and Receive Supported
acl : 2
aerl : 3
frmw : 0x14
[5:5] : 0 Multiple FW or Boot Update Detection Not Supported
[4:4] : 0x1 Firmware Activate Without Reset Supported
[3:1] : 0x2 Number of Firmware Slots
[0:0] : 0 Firmware Slot 1 Read/Write
lpa : 0x2
[6:6] : 0 Telemetry Log Data Area 4 Not Supported
[5:5] : 0 LID 0x0, Scope of each command in LID 0x5, 0x12, 0x13 Not Supported
[4:4] : 0 Persistent Event log Not Supported
[3:3] : 0 Telemetry host/controller initiated log page Not Supported
[2:2] : 0 Extended data for Get Log Page Not Supported
[1:1] : 0x1 Command Effects Log Page Supported
[0:0] : 0 SMART/Health Log Page per NS Not Supported
elpe : 63
[7:0] : 63 (0's based) Error Log Page Entries (ELPE)
npss : 4
[7:0] : 4 (0's based) Number of Power States Support (NPSS)
avscc : 0x1
[0:0] : 0x1 Admin Vendor Specific Commands uses NVMe Format
apsta : 0x1
[0:0] : 0x1 Autonomous Power State Transitions Supported
wctemp : 363
[15:0] : 90 °C (363 K) Warning Composite Temperature Threshold (WCTEMP)
cctemp : 368
[15:0] : 95 °C (368 K) Critical Composite Temperature Threshold (CCTEMP)
mtfa : 100
hmpre : 4096
hmmin : 4096
tnvmcap : 1,024,209,543,168
[127:0] : 1,024,209,543,168
Total NVM Capacity (TNVMCAP)
unvmcap : 0
[127:0] : 0
Unallocated NVM Capacity (UNVMCAP)
rpmbs : 0
[31:24]: 0 Access Size
[23:16]: 0 Total Size
[5:3] : 0 Authentication Method
[2:0] : 0 Number of RPMB Units
edstt : 5
dsto : 1
fwug : 1
kas : 0
hctma : 0x1
[0:0] : 0x1 Host Controlled Thermal Management Supported
mntmt : 373
[15:0] : 100 °C (373 K) Minimum Thermal Management Temperature (MNTMT)
mxtmt : 388
[15:0] : 115 °C (388 K) Maximum Thermal Management Temperature (MXTMT)
sanicap : 0x2
[31:30] : 0 Additional media modification after sanitize operation completes successfully is not defined
[29:29] : 0 No-Deallocate After Sanitize bit in Sanitize command Supported
[2:2] : 0 Overwrite Sanitize Operation Not Supported
[1:1] : 0x1 Block Erase Sanitize Operation Supported
[0:0] : 0 Crypto Erase Sanitize Operation Not Supported
hmminds : 4
hmmaxd : 32
nsetidmax : 0
endgidmax : 0
anatt : 0
anacap : 0
[7:7] : 0 Non-zero group ID Not Supported
[6:6] : 0 Group ID does change
[4:4] : 0 ANA Change state Not Supported
[3:3] : 0 ANA Persistent Loss state Not Supported
[2:2] : 0 ANA Inaccessible state Not Supported
[1:1] : 0 ANA Non-optimized state Not Supported
[0:0] : 0 ANA Optimized state Not Supported
anagrpmax : 0
nanagrpid : 0
pels : 0
domainid : 0
megcap : 0
sqes : 0x66
[7:4] : 0x6 Max SQ Entry Size (64)
[3:0] : 0x6 Min SQ Entry Size (64)
cqes : 0x44
[7:4] : 0x4 Max CQ Entry Size (16)
[3:0] : 0x4 Min CQ Entry Size (16)
maxcmd : 0
nn : 1
oncs : 0x1f
[8:8] : 0 Copy Not Supported
[7:7] : 0 Verify Not Supported
[6:6] : 0 Timestamp Not Supported
[5:5] : 0 Reservations Not Supported
[4:4] : 0x1 Save and Select Supported
[3:3] : 0x1 Write Zeroes Supported
[2:2] : 0x1 Data Set Management Supported
[1:1] : 0x1 Write Uncorrectable Supported
[0:0] : 0x1 Compare Supported
fuses : 0
[0:0] : 0 Fused Compare and Write Not Supported
fna : 0
[3:3] : 0 Format NVM Broadcast NSID (FFFFFFFFh) Supported
[2:2] : 0 Crypto Erase Not Supported as part of Secure Erase
[1:1] : 0 Crypto Erase Applies to Single Namespace(s)
[0:0] : 0 Format Applies to Single Namespace(s)
vwc : 0x7
[2:1] : 0x3 The Flush command supports NSID set to FFFFFFFFh
[0:0] : 0x1 Volatile Write Cache Present
awun : 0
awupf : 0
icsvscc : 0
[0:0] : 0 NVM Vendor Specific Commands uses Vendor Specific Format
nwpc : 0
[2:2] : 0 Permanent Write Protect Not Supported
[1:1] : 0 Write Protect Until Power Supply Not Supported
[0:0] : 0 No Write Protect and Write Protect Namespace Not Supported
acwu : 0
ocfs : 0
[1:1] : 0 Controller Copy Format 1h Not Supported
[0:0] : 0 Controller Copy Format 0h Not Supported
sgls : 0
[15:8] : 0 SGL Descriptor Threshold
[1:0] : 0 Scatter-Gather Lists Not Supported
mnan : 0
maxdna : 0
maxcna : 0
subnqn :
ioccsz : 0
iorcsz : 0
icdoff : 0
fcatt : 0
[0:0] : 0 Dynamic Controller Model
msdbd : 0
ofcs : 0
[0:0] : 0 Disconnect command Not Supported
ps 0 : mp:6.50W operational enlat:0 exlat:0 rrt:0 rrl:0
rwt:0 rwl:0 idle_power:- active_power:-
active_power_workload:-
ps 1 : mp:5.80W operational enlat:0 exlat:0 rrt:1 rrl:1
rwt:1 rwl:1 idle_power:- active_power:-
active_power_workload:-
ps 2 : mp:3.60W operational enlat:0 exlat:0 rrt:2 rrl:2
rwt:2 rwl:2 idle_power:- active_power:-
active_power_workload:-
ps 3 : mp:0.7460W non-operational enlat:5000 exlat:10000 rrt:3 rrl:3
rwt:3 rwl:3 idle_power:- active_power:-
active_power_workload:-
ps 4 : mp:0.7260W non-operational enlat:8000 exlat:45000 rrt:4 rrl:4
rwt:4 rwl:4 idle_power:- active_power:-
active_power_workload:-
I will do the nvme id-ctrl this afternoon when I get home, the NVMe disk that causes the stack smashing on my PC is a kingspec brand bought from aliexpress, while the other NVMe disk which is a Gigabyte brand it executes the sanitize-log without problems.
From the log of allfox, it seems that it is happening also on a chinese brand disk (J.ZAO). May be these chinese-brand disks are not following the "NVMe standard" specifications and this is causing the problems.
Anyway, this afternoon I will send the id-ctrl to see if it helps.
Okay, while at it, could you check if this commit is part of the kernel you are running:
4aedb705437f ("nvme-pci: split metadata handling from nvme_map_data / nvme_unmap_data")
There is a fix for this one which might be necessary: b8f6446b6853 ("nvme-pci: fix DMA direction of unmapping integrity data")
(btw, this is just wild guessing on my side)
I'm running Debian Testing, with a kernel 6.3.0-1-amd64. I would need some time to check that commit.
I can't find 4aedb705437f in https://github.com/torvalds/linux
For b8f6446b6853, I could see it is NOT in my kernel: https://sources.debian.org/src/linux/6.3.7-1/drivers/nvme/host/pci.c/#L972
4aedb705437f is part of v5.2
and b8f6446b6853 is part of the current merge window
but as I said, not necessary relevant. I just stumbled across while trying to read what's going on the lower levels
Well, one experiment is to check if v5.1 does also show this problem if you have time and the possibility. The v5.2 kernel had a bunch of refactoring in this area.
I think there are some good news here.
I tried to build linux-4.19.289 (the last LTS before v5.2) and it seems work with udisks2. However linux-6.5-rc3 would still be stack smashing.
linux-4.19.289 would still be an error in its syslog, however without termination:
udisksd[810]: Error performing housekeeping for drive /org/freedesktop/UDisks2/drives/J_2eZAO_QL_SERIES_1TB_SSD_SNhideFromPublic: Error updating Health Information: No probed controller info available (udisks-error-quark, 0)
So I think that maybe I could git bisect to find out which commit cause the problem.
But I failed to do so. There is a GCC update last year, Linux kernel between 4.20 and v5.17-rc5 will not build with Debian GCC 10, 11 nor 12. They would need another patch: https://github.com/torvalds/linux/commit/52a9dab6d892763b2a8334a568bd4e2c1a6fde66
I tried and failed to cherry pick this patch into linux-v4.20, GCC would complain another problem about some table address or something, I didn't dig in.
As 4.19.289 works, it might be a kernel bug. However I'm still not convinced. Maybe it's just my SSD got some strange problem with it. It is a cheap but branded one. I'll report back if I have more news.
Thanks for doing the testing. Just to clarify, Linux 4.19.289 works, but newer kernels break with the exact same userspace?
On my end, linux-4.19.289 works. No userspace change.
About "newer", linux-4.19.289 is LTS, so it is newer than linux-5.1. So I couldn't define the bug position with "newer" or "elder".
All I tried: linux-4.19.289 works linux-6.5-rc3 stack smashing linux-5.17.0-rc4+ stack smashing <-- this is https://github.com/torvalds/linux/commit/52a9dab6d892763b2a8334a568bd4e2c1a6fde66 the elderest kernel I could build with stock Debian GCC
Alright, I'll post this info on the nvme kernel mailing list. Maybe someone spots the problem.
Thanks for concern.
I think maybe it would be better to wait some input from marcbres as only a single sample from me could be just some wrong with my end.
No worries, if an older kernel works and a newer not, it's a strong indication that we have to look at the kernel.
https://lore.kernel.org/linux-nvme/dckvq3ohgqdkxg7wz4vslf36gaivgldl742vew67oyh2buj5nr@7ypn5ydt6kss/
Here it is the nvme id-ctrl -H /dev/nvme1 for the drive that causes the smashing:
NVME Identify Controller:
vid : 0x1e4b
ssvid : 0x1e4b
sn : 9110129M00467
mn : NE-1TB
fr : SN02486
rab : 6
ieee : 000000
cmic : 0
[3:3] : 0 ANA not supported
[2:2] : 0 PCI
[1:1] : 0 Single Controller
[0:0] : 0 Single Port
mdts : 9
cntlid : 0
ver : 0x10300
rtd3r : 0x7a120
rtd3e : 0x1e8480
oaes : 0x200
[31:31] : 0 Discovery Log Change Notice Not Supported
[27:27] : 0 Zone Descriptor Changed Notices Not Supported
[15:15] : 0 Normal NSS Shutdown Event Not Supported
[14:14] : 0 Endurance Group Event Aggregate Log Page Change Notice Not Supported
[13:13] : 0 LBA Status Information Notices Not Supported
[12:12] : 0 Predictable Latency Event Aggregate Log Change Notices Not Supported
[11:11] : 0 Asymmetric Namespace Access Change Notices Not Supported
[9:9] : 0x1 Firmware Activation Notices Supported
[8:8] : 0 Namespace Attribute Changed Event Not Supported
ctratt : 0
[19:19] : 0 Flexible Data Placement Not Supported
[15:15] : 0 Extended LBA Formats Not Supported
[14:14] : 0 Delete NVM Set Not Supported
[13:13] : 0 Delete Endurance Group Not Supported
[12:12] : 0 Variable Capacity Management Not Supported
[11:11] : 0 Fixed Capacity Management Not Supported
[10:10] : 0 Multi Domain Subsystem Not Supported
[9:9] : 0 UUID List Not Supported
[8:8] : 0 SQ Associations Not Supported
[7:7] : 0 Namespace Granularity Not Supported
[6:6] : 0 Traffic Based Keep Alive Not Supported
[5:5] : 0 Predictable Latency Mode Not Supported
[4:4] : 0 Endurance Groups Not Supported
[3:3] : 0 Read Recovery Levels Not Supported
[2:2] : 0 NVM Sets Not Supported
[1:1] : 0 Non-Operational Power State Permissive Not Supported
[0:0] : 0 128-bit Host Identifier Not Supported
rrls : 0
cntrltype : 0
[7:2] : 0 Reserved
[1:0] : 0 Controller type not reported
fguid : 20202020-2020-2020-0000-000000000000
crdt1 : 0
crdt2 : 0
crdt3 : 0
nvmsr : 0
[1:1] : 0 NVM subsystem Not part of an Enclosure
[0:0] : 0 NVM subsystem Not part of a Storage Device
vwci : 0
[7:7] : 0 VPD Write Cycles Remaining field is Not valid.
[6:0] : 0 VPD Write Cycles Remaining
mec : 0
[1:1] : 0 NVM subsystem Not contains a Management Endpoint on a PCIe port
[0:0] : 0 NVM subsystem Not contains a Management Endpoint on an SMBus/I2C port
oacs : 0x6
[10:10] : 0 Lockdown Command and Feature Not Supported
[9:9] : 0 Get LBA Status Capability Not Supported
[8:8] : 0 Doorbell Buffer Config Not Supported
[7:7] : 0 Virtualization Management Not Supported
[6:6] : 0 NVMe-MI Send and Receive Not Supported
[5:5] : 0 Directives Not Supported
[4:4] : 0 Device Self-test Not Supported
[3:3] : 0 NS Management and Attachment Not Supported
[2:2] : 0x1 FW Commit and Download Supported
[1:1] : 0x1 Format NVM Supported
[0:0] : 0 Security Send and Receive Not Supported
acl : 2
aerl : 3
frmw : 0x4
[5:5] : 0 Multiple FW or Boot Update Detection Not Supported
[4:4] : 0 Firmware Activate Without Reset Not Supported
[3:1] : 0x2 Number of Firmware Slots
[0:0] : 0 Firmware Slot 1 Read/Write
lpa : 0x3
[6:6] : 0 Telemetry Log Data Area 4 Not Supported
[5:5] : 0 LID 0x0, Scope of each command in LID 0x5, 0x12, 0x13 Not Supported
[4:4] : 0 Persistent Event log Not Supported
[3:3] : 0 Telemetry host/controller initiated log page Not Supported
[2:2] : 0 Extended data for Get Log Page Not Supported
[1:1] : 0x1 Command Effects Log Page Supported
[0:0] : 0x1 SMART/Health Log Page per NS Supported
elpe : 63
[7:0] : 63 (0's based) Error Log Page Entries (ELPE)
npss : 4
[7:0] : 4 (0's based) Number of Power States Support (NPSS)
avscc : 0x1
[0:0] : 0x1 Admin Vendor Specific Commands uses NVMe Format
apsta : 0x1
[0:0] : 0x1 Autonomous Power State Transitions Supported
wctemp : 393
[15:0] : 120 °C (393 K) Warning Composite Temperature Threshold (WCTEMP)
cctemp : 403
[15:0] : 130 °C (403 K) Critical Composite Temperature Threshold (CCTEMP)
mtfa : 100
hmpre : 4096
hmmin : 4096
tnvmcap : 1.024.209.543.168
[127:0] : 1.024.209.543.168
Total NVM Capacity (TNVMCAP)
unvmcap : 0
[127:0] : 0
Unallocated NVM Capacity (UNVMCAP)
rpmbs : 0
[31:24]: 0 Access Size
[23:16]: 0 Total Size
[5:3] : 0 Authentication Method
[2:0] : 0 Number of RPMB Units
edstt : 0
dsto : 1
fwug : 1
kas : 0
hctma : 0x1
[0:0] : 0x1 Host Controlled Thermal Management Supported
mntmt : 373
[15:0] : 100 °C (373 K) Minimum Thermal Management Temperature (MNTMT)
mxtmt : 388
[15:0] : 115 °C (388 K) Maximum Thermal Management Temperature (MXTMT)
sanicap : 0x2
[31:30] : 0 Additional media modification after sanitize operation completes successfully is not defined
[29:29] : 0 No-Deallocate After Sanitize bit in Sanitize command Supported
[2:2] : 0 Overwrite Sanitize Operation Not Supported
[1:1] : 0x1 Block Erase Sanitize Operation Supported
[0:0] : 0 Crypto Erase Sanitize Operation Not Supported
hmminds : 0
hmmaxd : 0
nsetidmax : 0
endgidmax : 0
anatt : 0
anacap : 0
[7:7] : 0 Non-zero group ID Not Supported
[6:6] : 0 Group ID does change
[4:4] : 0 ANA Change state Not Supported
[3:3] : 0 ANA Persistent Loss state Not Supported
[2:2] : 0 ANA Inaccessible state Not Supported
[1:1] : 0 ANA Non-optimized state Not Supported
[0:0] : 0 ANA Optimized state Not Supported
anagrpmax : 0
nanagrpid : 0
pels : 0
domainid : 0
megcap : 0
sqes : 0x66
[7:4] : 0x6 Max SQ Entry Size (64)
[3:0] : 0x6 Min SQ Entry Size (64)
cqes : 0x44
[7:4] : 0x4 Max CQ Entry Size (16)
[3:0] : 0x4 Min CQ Entry Size (16)
maxcmd : 0
nn : 1
oncs : 0xf
[8:8] : 0 Copy Not Supported
[7:7] : 0 Verify Not Supported
[6:6] : 0 Timestamp Not Supported
[5:5] : 0 Reservations Not Supported
[4:4] : 0 Save and Select Not Supported
[3:3] : 0x1 Write Zeroes Supported
[2:2] : 0x1 Data Set Management Supported
[1:1] : 0x1 Write Uncorrectable Supported
[0:0] : 0x1 Compare Supported
fuses : 0
[0:0] : 0 Fused Compare and Write Not Supported
fna : 0x3
[3:3] : 0 Format NVM Broadcast NSID (FFFFFFFFh) Supported
[2:2] : 0 Crypto Erase Not Supported as part of Secure Erase
[1:1] : 0x1 Crypto Erase Applies to All Namespace(s)
[0:0] : 0x1 Format Applies to All Namespace(s)
vwc : 0x1
[2:1] : 0 Support for the NSID field set to FFFFFFFFh is not indicated
[0:0] : 0x1 Volatile Write Cache Present
awun : 0
awupf : 0
icsvscc : 0
[0:0] : 0 NVM Vendor Specific Commands uses Vendor Specific Format
nwpc : 0
[2:2] : 0 Permanent Write Protect Not Supported
[1:1] : 0 Write Protect Until Power Supply Not Supported
[0:0] : 0 No Write Protect and Write Protect Namespace Not Supported
acwu : 0
ocfs : 0
[1:1] : 0 Controller Copy Format 1h Not Supported
[0:0] : 0 Controller Copy Format 0h Not Supported
sgls : 0
[15:8] : 0 SGL Descriptor Threshold
[1:0] : 0 Scatter-Gather Lists Not Supported
mnan : 0
maxdna : 0
maxcna : 0
subnqn :
ioccsz : 0
iorcsz : 0
icdoff : 0
fcatt : 0
[0:0] : 0 Dynamic Controller Model
msdbd : 0
ofcs : 0
[0:0] : 0 Disconnect command Not Supported
ps 0 : mp:6.50W operational enlat:0 exlat:0 rrt:0 rrl:0
rwt:0 rwl:0 idle_power:- active_power:-
active_power_workload:-
ps 1 : mp:5.80W operational enlat:0 exlat:0 rrt:1 rrl:1
rwt:1 rwl:1 idle_power:- active_power:-
active_power_workload:-
ps 2 : mp:3.60W operational enlat:0 exlat:0 rrt:2 rrl:2
rwt:2 rwl:2 idle_power:- active_power:-
active_power_workload:-
ps 3 : mp:0.0500W non-operational enlat:5000 exlat:10000 rrt:3 rrl:3
rwt:3 rwl:3 idle_power:- active_power:-
active_power_workload:-
ps 4 : mp:0.0025W non-operational enlat:5000 exlat:45000 rrt:4 rrl:4
rwt:4 rwl:4 idle_power:- active_power:-
active_power_workload:-
As you mentioned early igaw, it seems that this drive does not support some sanitize options:
[31:30] : 0 Additional media modification after sanitize operation completes successfully is not defined
[29:29] : 0 No-Deallocate After Sanitize bit in Sanitize command Supported
[2:2] : 0 Overwrite Sanitize Operation Not Supported
[1:1] : 0x1 Block Erase Sanitize Operation Supported
[0:0] : 0 Crypto Erase Sanitize Operation Not Supported
I'm also running debian testing with kernel 6.3.0-1-amd64
As you mentioned early igaw, it seems that this drive does not support some sanitize options:
that is okay, if the whole register of sanicap is not 0, than it's supported. In your case the 'Block Erase Sanitize Operation' is supported the others not.
I got some feedback from the mailing list. Need to figure out what to do with it. The current assumption is that the direct mapping of the DMA buffers to userspace are the culprint and the devices are not behaving correctly. That means we tell it to transfer 512 bytes but it transfers more, e.g. 4k. That would smash the stack.
udisksd[810]: Error performing housekeeping for drive /org/freedesktop/UDisks2/drives/J_2eZAO_QL_SERIES_1TB_SSD_SNhideFromPublic: Error updating Health Information: No probed controller info available (udisks-error-quark, 0)
This indicates that nvme_identify_ctrl()
failed earlier in the process. In such case udisks doesn't submit any nvme_get_log_sanitize()
calls. Are you sure that nvme id-ctrl /dev/nvme0
actually works for you on 4.19.289?
Maybe the devices do not support
SANICAP
and thus we should not issue the command in the first place.
Good point. We actually check for controller capabilities (struct nvme_id_ctrl.sanicap
) first, same for the self-test log.
Are you sure that
nvme id-ctrl /dev/nvme0
actually works for you on 4.19.289?
Yes, on 4.19.289 both nvme sanitize-log /dev/nvme0
and nvme id-ctrl /dev/nvme0
would show result with exit code 0 and no stack smashing message.
And I could see /usr/libexec/udisks2/udisksd is running.
If I plug in a USB flash drive, it would be automatically mounted.
Good Morning,
I've moved the "faulting" disk to another pc running debian stable. It mounts well, as on the original PC with debian testing.
# uname -a
Linux gondor.bres.cat 6.1.0-10-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.38-1 (2023-07-14) x86_64 GNU/Linux
And libnvme 1.3-1, and there, when executing the sanitize-log function returns an error, but does not produce the stack smashing:
# nvme sanitize-log -H /dev/nvme0
NVMe status: Invalid Field in Command: A reserved coded value or an unsupported value in a defined field(0x2)
I'm posting this, maybe it's somewhat useful
Thanks for this test. I think we can safely say it's the device which causing the issue. With kernel 6.1 and libnvme 1.3 it works. That is strange. Have to check the kernel configs for stable and sid.
[edit] Were there any addition kernel logs?
Anyway, I got some feedback from the mailing list. The old behavior of the kernel was to copy the buffers from kernel space to user space and enforcing all the size limits. Newer kernel support direct dma mappings to user space. Given this, it's possible that we request 512 bytes but the device writes more than those 512 bytes back and thus over writes the stack.
To verify this theory we can play with IOMMU mappings and/or some hacks so that we never do the direct mappings. But this is what you did with switching to Debian stable.
Anyway, I wont have time to work on this right now. I am off for a week.
I've switched back to my main PC with Debian Testing, and doing the nvme sanitize-log to cause the stack smashing error creates the following entries on kern.log:
2023-07-29T07:48:08.655812+02:00 helm kernel: [ 489.963214] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x0 flags=0x0020]
2023-07-29T07:48:08.655814+02:00 helm kernel: [ 489.963220] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x80 flags=0x0020]
2023-07-29T07:48:08.655815+02:00 helm kernel: [ 489.963223] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x100 flags=0x0020]
2023-07-29T07:48:08.655816+02:00 helm kernel: [ 489.963229] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x180 flags=0x0020]
2023-07-29T07:48:08.655817+02:00 helm kernel: [ 489.963234] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x200 flags=0x0020]
2023-07-29T07:48:08.655817+02:00 helm kernel: [ 489.963239] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x280 flags=0x0020]
2023-07-29T07:48:08.655818+02:00 helm kernel: [ 489.963242] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x300 flags=0x0020]
2023-07-29T07:48:08.655819+02:00 helm kernel: [ 489.963245] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x380 flags=0x0020]
2023-07-29T07:48:08.655826+02:00 helm kernel: [ 489.963248] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x400 flags=0x0020]
2023-07-29T07:48:08.655827+02:00 helm kernel: [ 489.963251] nvme 0000:04:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0010 address=0x480 flags=0x0020]
Hope this helps.
Anyway, I wont have time to work on this right now. I am off for a week.
No worries, I can still work with the PC if I don't upgrade the udisks2 pakcage, so for me is not urgent. Happy holidays!
I'm start to hesitate it might be my SSD is the bug. Not sure, just for info:
# lspci | grep NVMe 01:00.0 Non-Volatile memory controller: MAXIO Technology (Hangzhou) Ltd. NVMe SSD Controller MAP1202 (rev 01)
Hi folks, I have exactly the same nvme from J.ZAO and facing this issue after upgrading my machine running Debian Sid weeks ago. So I suppose it is highly related to the hardware.
Any chance to test this? https://github.com/storaged-project/libblockdev/pull/951
Wanted to avoid any kind of blacklist and this was a straight dumb way with the hope that 64k ought be enough :tm:.
Any chance to test this? storaged-project/libblockdev#951
Wanted to avoid any kind of blacklist and this was a straight dumb way with the hope that 64k ought be enough ™️.
Greetings. I think this patch is working on my end. Now I could see udisks2 service is up and running. And it will mount USB flash drive automatically.
It would be better to have someone else to test the patch, so I would write what I tried: I downloaded https://github.com/tbzatek/libblockdev/tree/nvme-27 source zip package. Do "apt install autoconf-archive" as Debian's autoconf do not depend on autoconf-archive Run ./autogen.sh Run ./configure Do "apt install" for all those lib***-dev packages. Run ./configure again Run make Run make install Edit /lib/systemd/system/udisks2.service , add a line before ExecStart=, so that homebrew libblockdev would be used:
Environment=LD_LIBRARY_PATH=/usr/local/lib
Reboot the system.
Thanks for your help! I didn't know man could write a working patch without having the bug at hand .
All I tried: linux-4.19.289 works linux-6.5-rc3 stack smashing linux-5.17.0-rc4+ stack smashing <-- this is torvalds/linux@52a9dab the elderest kernel I could build with stock Debian GCC
It stopped working on 5.8 (5.7 is fine), bisected to commit:
But it looks like reverting it is not enough.
Removing all blk_queue_dma_alignment()
calls from drivers/nvme
doesn't fix it though.
UPD Re-checked: removing blk_queue_dma_alignment() from nvme_set_queue_limits() helps
Thanks for bisecting. From looking into what the effects are, I'd say 3b2a1ebceba3 and 52fde2c07da just enables zero copy I/O instead using a bounce buffer (see https://github.com/torvalds/linux/blob/4853c74bd7ab7fdb83f319bd9ace8a08c031e9b6/block/blk-map.c#L626).
So you should get it working by just hacking blk_rq_map_user_iov
always using bounce buffers.
Could you try this?
diff --git a/block/blk-map.c b/block/blk-map.c
index 44d74a30ddac..7a13e0fcba7f 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -627,7 +627,7 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
struct rq_map_data *map_data,
const struct iov_iter *iter, gfp_t gfp_mask)
{
- bool copy = false, map_bvec = false;
+ bool copy = true, map_bvec = false;
unsigned long align = q->dma_pad_mask | queue_dma_alignment(q);
struct bio *bio = NULL;
struct iov_iter i;
Could you try this?
Nope. It doesn't help
UPD: it happens on certain disks, swapping with the working one helps. And the issue starts happened on another PC
D'oh, the copy/map_bvec hack works only if we assign the values after the setter block:
diff --git a/block/blk-map.c b/block/blk-map.c
index 44d74a30ddac..8f1dd793e825 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -646,6 +646,9 @@ int blk_rq_map_user_iov(struct request_queue *q, struct request *rq,
else if (queue_virt_boundary(q))
copy = queue_virt_boundary(q) & iov_iter_gap_alignment(iter);
+ map_bvec = false;
+ copy = true;
+
if (map_bvec) {
ret = blk_rq_map_user_bvec(rq, iter);
if (!ret)
if (map_bvec)
-- the condition will always be false.
UPD it worked
Great to hear. So nvme list
doesn't crash and there are no kernel messages?
never tested with nvme list
I did nvme sanitize-log /dev/nvme0n1
Typo, I wanted to write nvme sanitize-log
. No entries in the kernel logs, right?
If you have some more test cycles you could enable all memory debug options especially KASAN. The kernel will likely report the overwrite then too.
I tried new HACK on linux-6.4.11, it worked.
Both 'sudo nvme sanitize-log /dev/nvme0' and udisks2 would work without stack smashing.
I built the kernel with CONFIG_KASAN=y however there is no new info in /var/log/syslog when I request 'nvme sanitize-log'. Maybe some more setup is required than the Kernel building option.
And few days before, I sent an e-mail to MAXIO about this. Receiving no reply. I think it would be the best if the upstream controller chip maker could fix this at their side. And the Linux kernel could have some defending methods against this kind of problem, like the Kernel DMA protection in Windows. Just wish.
I built the kernel with CONFIG_KASAN=y however there is no new info in /var/log/syslog when I request 'nvme sanitize-log'. Maybe some more setup is required than the Kernel building option.
I need to check how this part of the code work, it might just allocate a bigger buffer and thus avoiding the problem. But thanks for testing. Highly appreciated.
And few days before, I sent an e-mail to MAXIO about this. Receiving no reply. I think it would be the best if the upstream controller chip maker could fix this at their side.
This is obviously the preferred solution, but not sure if you get any feedback. But thanks for trying!
And the Linux kernel could have some defending methods against this kind of problem, like the Kernel DMA protection in Windows. Just wish.
Indeed, we have now all the data to discuss a solution. Let's see what the outcome is on the nvme mailing list.
Do "apt install" for all those lib***-dev packages. Run ./configure again
Man, you weren't EVEN joking there! ;)
Thanks for the VERY helpful how-to. Worked like a charm!
EDIT: For the record, udisks2.service runs now and everything works except "nvme sanitize-log /dev/nvme0n1" still produces a stack smash. This is on Ubuntu 23.10 mantic with a Fanxiang S500PRO 1TB (SN07939) NVMe drive.
EDIT2: I also noticed that, not only does this bug cause gnome-disks to stop working (which makes sense), but it ALSO causes Gnome=>Settings=>About to freeze up the Settings app entirely (that section reports disk capacity). Once udisks2 is running properly, the issue goes away...
Good Morning,
After upgrading to 2.10.0-3 the udisks2 service on debian testing does not start. It makes sddm to delay its presentation for about 15 seconds and plasma desktop to not work properly (I can log in but does not load any of the plasma features, so no task bar, and unable to open any program)
Reverting back manually to 2.9.4 fixes the problems.
I've opened a message to debian maintainers and after reviewing the logs they asked me to open it here. I've added the traces to other bug, but seems that is not the same, therefore I am opening it here now. Here are the logs and the gdb trace:
systemctl status
gdb backtrace
hope it helps to find the problem. This PC has 2 nvme units and 2 sata SSDs. As the error seems to be on some nvme related function I thought to mention it just in case it is useful
I've been asked to provide libnvme version and output of nvme sanitize-log -H /dev/nvme0, so here they go:
libnvme version: 1.5-1
nvme sanitize-log -H /dev/nvme0:
By this log seems clear that the stack smashing detected is caused by some part of the NVMe system and propagates upwards to udisks2, but I don't know how to proceed. If I revert back to udisks 2.9.4-4 it starts, but the nvme sanitize-log gives the same error. From what I understand, seems that 2.10.0 relies more on the surrounding libraries, and 2.9.4 is more "error ignoring friendly"