Closed LefterisJP closed 2 years ago
The sqid is the admin queue, so the error is likely indicating the driver or some tool attempted a harmless optional command that your controller doesn't support. The device vendor is known to be overly pedantic about logging these errors. The spec allows this behavior, but it isn't required to do that, nor is it helpful to anyone.
Hey @keithbusch thank you for your very swift response!
So I see this is not something you guys could do on thenvme-cli side then. I would need to update firmware of the vendor?
Did you see my question on the self-test part? I don't want to switch this to IT support, but is there any advice on how to test this drive for errors and read/verify if there are any? I am trying to decide where errors lie to see if I need to change hardware and things point to the disk and not the RAM.
I'd need to consult the spec, but I don't have it handy at the moment (using my phone at the moment). The spec is free to download from nvmexpess.org it you want to view the source that defines this log page.
Seem this issue got stale. Closing it.
Absolutely same issue with same device, just 1TB size and nvme version 1.16
.
Any news about that and what we can do about it? smartd continuously reports via email:
... number of Error Log entries increased from 117 to 118
As Keith said, it's likely to be caused by an harmless optional command which gets logged by firmware. Do you happen do see anything in the kernel logs?
Nothing in the kernel logs.
Let's try to figure out which command it could be. Maybe we see it in the nvme trace:
cd /sys/kernel/debug/tracing/
echo 'status!=0' > events/nvme/nvme_complete_rq/filter
echo 1 > events/nvme/nvme_complete_rq/enable
echo 1 > events/nvme/nvme_async_event/enable
echo 1 > tracing_on
[wait for the failure]
echo 0 > tracing_on
cat trace > ~/nvme_complete_rq-trace.txt
Which kernel option do I need to have /sys/kernel/debug/tracing/ ?
I think CONFIG_FTRACE
is enough but in doubt you can enable most of tracers options (except the self tests)
I see that INVALID-FIELD error in the error-log whenever I suspend/resume the laptop, fwiw.
Nothing in the trace file after I completed the steps from https://github.com/linux-nvme/nvme-cli/issues/1224#issuecomment-1196488941
No clue, though it sounds more like a firmware issue. I see that there are bunch of Samsung devices which rely on a quirk to delay the check before the device is ready (NVME_QUIRK_DELAY_AMOUNT
). It might be worth an experiment to add the quirk to driver for this device.
If I am not complete mistaken the fix should be:
https://git.infradead.org/nvme.git/commitdiff/e6487833182a8a0187f0292aca542fc163ccd03e
Hm, I seem to have that patch applied. I'm running the ubuntu jammy kernel, https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/tree/drivers/nvme/host/core.c#n2501
5.15.0-48-generic #54-Ubuntu
my disk is a SAMSUNG MZVLW1T0HMLH-000L7 with firmware 7L7QCXY7 (PCI id 144d:a804 in lspci)
The matching logic is not just on the PCI id, it's also matching against mn
and fr
(model, firmware rev). The disk you seem to have is not identically to the one mention in the quirk.
What is the output from nvme id-ctrl
? E.g.
# nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid : 0x1179
ssvid : 0x1179
sn : 49MA23BZK03N
mn : KXG60ZNV512G NVMe TOSHIBA 512GB
fr : 10604107
What is the output from
nvme id-ctrl
# nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid : 0x144d
ssvid : 0x144d
sn : S35ANX0J711457
mn : SAMSUNG MZVLW1T0HMLH-000L7
fr : 7L7QCXY7
rab : 2
ieee : 002538
cmic : 0
mdts : 0
cntlid : 0x2
ver : 0x10200
rtd3r : 0x186a0
rtd3e : 0x4c4b40
oaes : 0
ctratt : 0
rrls : 0
cntrltype : 0
fguid :
crdt1 : 0
crdt2 : 0
crdt3 : 0
nvmsr : 0
vwci : 0
mec : 0
oacs : 0x17
acl : 7
aerl : 3
frmw : 0x16
lpa : 0x3
elpe : 63
npss : 4
avscc : 0x1
apsta : 0x1
wctemp : 342
cctemp : 345
mtfa : 0
hmpre : 0
hmmin : 0
tnvmcap : 1024209543168
unvmcap : 0
rpmbs : 0
edstt : 35
dsto : 0
fwug : 0
kas : 0
hctma : 0
mntmt : 0
mxtmt : 0
sanicap : 0
hmminds : 0
hmmaxd : 0
nsetidmax : 0
endgidmax : 0
anatt : 0
anacap : 0
anagrpmax : 0
nanagrpid : 0
pels : 0
domainid : 0
megcap : 0
sqes : 0x66
cqes : 0x44
maxcmd : 0
nn : 1
oncs : 0x1f
fuses : 0
fna : 0x4
vwc : 0x1
awun : 255
awupf : 0
icsvscc : 1
nwpc : 0
acwu : 0
ocfs : 0
sgls : 0
mnan : 0
maxdna : 0
maxcna : 0
subnqn :
ioccsz : 0
iorcsz : 0
icdoff : 0
fcatt : 0
msdbd : 0
ofcs : 0
ps 0 : mp:7.60W operational enlat:0 exlat:0 rrt:0 rrl:0
rwt:0 rwl:0 idle_power:- active_power:-
ps 1 : mp:6.00W operational enlat:0 exlat:0 rrt:1 rrl:1
rwt:1 rwl:1 idle_power:- active_power:-
ps 2 : mp:5.10W operational enlat:0 exlat:0 rrt:2 rrl:2
rwt:2 rwl:2 idle_power:- active_power:-
ps 3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3
rwt:3 rwl:3 idle_power:- active_power:-
ps 4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4
rwt:4 rwl:4 idle_power:- active_power:-
Right, the corresponding quirk entry would be:
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1a57b6392ee3..87fce7c46bd6 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2666,7 +2666,12 @@ static const struct nvme_core_quirk_entry core_quirks[] = {
.quirks = NVME_QUIRK_DELAY_BEFORE_CHK_RDY |
NVME_QUIRK_NO_DEEPEST_PS |
NVME_QUIRK_IGNORE_DEV_SUBNQN,
- }
+ },
+ {
+ .vid = 0x144d,
+ .mn = "SAMSUNG MZVLW1T0HMLH-000L7",
+ .quirks = NVME_QUIRK_DELAY_BEFORE_CHK_RDY,
+ },
};
Could try this patch on your kernel?
Sorry, I don't think that worked. Granted, patching a distribution kernel and booting off its packages is not as trivial as I thought, and I had to disable secure boot, but I do think I'm running the patched kernel and its modules, and the error count still increases by one after a suspend/resume cycle.
Thanks for trying to help me, though! :)
Sadly, 'security' makes everything really complex. Anyway, I would suggest to post these result on the nvme mailing list, CCing those guys which did the entry for the Samsung X5 model. Maybe the know more about it.
Sadly, 'security' makes everything really complex. Anyway, I would suggest to post these result on the nvme mailing list, CCing those guys which did the entry for the Samsung X5 model. Maybe the know more about it.
This may need a quick side discussion at an alpine hut. Are you free on Tuesday? :)
On Sun, Oct 09, 2022 at 06:54:36AM -0700, Keith Busch wrote:
Sadly, 'security' makes everything really complex. Anyway, I would suggest to post these result on the nvme mailing list, CCing those guys which did the entry for the Samsung X5 model. Maybe the know more about it.
This may need a quick side discussion at an alpine hut. Are you free on Tuesday? :)
Sure, equipped with a beer :)
I'm having the same issue but with Crucial P3 CT4000P3SSD8. I have two same ones running in the ZFS mirror pool on Ubuntu 22.04 (kernel 5.15.0-67-generic) and every single system restart adds 2 errors to the SMART error log for each drive - 0x2002(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field).
Annoying.
Ubuntu 22.04.3, Kernel: 5.15.0-79-generic.
$ sudo nvme list
Node SN Model Namespace Usage Format FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S59CNM0W436651E Samsung SSD 970 EVO Plus 2TB 1 340.03 GB / 2.00 TB 512 B + 0 B 2B2QEXM7
$ sudo nvme error-log -e 1 /dev/nvme0
Error Log Entries for device:nvme0 entries:1
.................
Entry[ 0]
.................
error_count : 28
sqid : 0
cmdid : 0x4018
status_field : 0x2002(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
phase_tag : 0
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
$ sudo nvme self-test-log --dst-entries 1 -v /dev/nvme0
Device Self Test Log for NVME device:nvme0
Current operation : 0
Current Completion : 0%
Self Test Result[0]:
Operation Result : 0 Operation completed without error
Self Test Code : 1 Short device self-test operation
Valid Diagnostic Information : 0
Power on hours (POH) : 0x2e
Vendor Specific : 0 0
zangetsu X10SRA ~ sudo nvme list
Node SN Model Namespace Usage Format FW Rev
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S3ETNX0J101327H Samsung SSD 960 EVO 1TB 1 272,62 GB / 1,00 TB 512 B + 0 B 1B7QCXE7
/dev/nvme1n1 S649NJ0R208640Z Samsung SSD 980 1TB 1 473,03 GB / 1,00 TB 512 B + 0 B 1B4QFXO7
/dev/nvme2n1 50026B728283D8FC KINGSTON SKC2500M81000G 1 472,69 GB / 1,00 TB 512 B + 0 B S7780101
/dev/nvme3n1 S649NJ0R210760L Samsung SSD 980 1TB 1 472,92 GB / 1,00 TB 512 B + 0 B 1B4QFXO7
I got errors logged only on my boot disc which is Samsung SSD 960 EVO 1TB
zangetsu X10SRA ~ sudo smartctl -a /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-6.2.0-39-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 960 EVO 1TB
Serial Number: S3ETNX0J101327H
Firmware Version: 1B7QCXE7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 1 000 204 886 016 [1,00 TB]
Unallocated NVM Capacity: 0
Controller ID: 2
NVMe Version: 1.2
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1 000 204 886 016 [1,00 TB]
Namespace 1 Utilization: 272 623 296 512 [272 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5171b064ee
Local Time is: Mon Dec 18 22:15:19 2023 CET
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0007): Security Format Frmw_DL
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 77 Celsius
Critical Comp. Temp. Threshold: 79 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.04W - - 0 0 0 0 0 0
1 + 5.09W - - 1 1 1 1 0 0
2 + 4.08W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1500
4 - 0.0050W - - 4 4 4 4 2200 6000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 28 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 1%
Data Units Read: 15 030 470 [7,69 TB]
Data Units Written: 43 606 695 [22,3 TB]
Host Read Commands: 330 002 480
Host Write Commands: 552 071 418
Controller Busy Time: 2 315
Power Cycles: 751
Power On Hours: 2 161
Unsafe Shutdowns: 514
Media and Data Integrity Errors: 0
Error Information Log Entries: 409
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 28 Celsius
Temperature Sensor 2: 38 Celsius
Error Information (NVMe Log 0x01, 16 of 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 409 0 0x000c 0x4004 - 0 0 -
1 408 0 0x0008 0x4004 0x028 0 1 -
2 407 0 0x0007 0x4004 0x028 0 1 -
3 406 0 0x0006 0x4004 0x028 0 1 -
4 405 0 0x0008 0x4004 - 0 0 -
5 404 0 0x0008 0x4004 0x028 0 1 -
6 403 0 0x0007 0x4004 0x028 0 1 -
7 402 0 0x0006 0x4004 0x028 0 1 -
8 401 0 0x5000 0x4004 - 0 0 -
9 400 0 0x0008 0x4004 0x028 0 1 -
10 399 0 0x0007 0x4004 0x028 0 1 -
11 398 0 0x0006 0x4004 0x028 0 1 -
12 397 0 0x0000 0x4004 - 0 0 -
13 396 0 0x0008 0x4004 0x028 0 1 -
14 395 0 0x0007 0x4004 0x028 0 1 -
15 394 0 0x0006 0x4004 0x028 0 1 -
... (48 entries not read)
Its repeatative error, sample here:
Entry[63]
.................
error_count : 346
sqid : 0
cmdid : 0xb001
status_field : 0x2002(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
phase_tag : 0
parm_err_loc : 0xffff
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0
.................
I will try to send it also to the nvme mailing list devs....
This should be addressed with the upcoming kernel 6.8 and the next release of libnvme (see https://github.com/linux-nvme/libnvme/issues/681)
hi guys i know it's been a year this theread is closet but i want say i still get this error on most distro i use like debian , opensuse tumbleweed , and Arch i'm still in arch and i didn't notice that error unitl last week and the num_err_log_entries riase up to 5526. number goes up every time i reboot the system i have samsung Evo 970 plus 500 GB
down is my error
Error Log Entries for device:nvme0n1 entries:64 ................. Entry[ 0] ................. error_count : 5526 sqid : 0 cmdid : 0x2019 status_field : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field) phase_tag : 0 parm_err_loc : 0xffff lba : 0 nsid : 0 vs : 0 trtype : The transport type is not indicated or the error is not transport related. csi : 0 opcode : 0 cs : 0 trtype_spec_info: 0 log_page_version: 0
this is my smart-log
` Smart Log for NVME device:nvme0n1 namespace-id:ffffffff critical_warning : 0 temperature : 44 °C (317 K) available_spare : 100% available_spare_threshold : 10% percentage_used : 1% endurance group critical warning summary: 0 Data Units Read : 61521857 (31.50 TB) Data Units Written : 38671369 (19.80 TB) host_read_commands : 818138247 host_write_commands : 1091584779 controller_busy_time : 2208 power_cycles : 4966 power_on_hours : 2003 unsafe_shutdowns : 647 media_errors : 0 num_err_log_entries : 5526 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Temperature Sensor 1 : 44 °C (317 K) Temperature Sensor 2 : 42 °C (315 K) Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0
` sorry if i had English problem
number goes up every time i reboot the system
The kernel issues a few commands to identify the device. Although these commands are valid, this particular firmware can't handle them and logs them. nvme-cli/libnvme doesn't issue these anymore with recent kernels (you distro kernel might be too old). There were some discussion on how to handle this and if I remember correctly, the outcome was 'buggy firmware'.
In short, try to convince your vendor to fix the firmware or/and report this to the nvme linux mailing list.
you distro kernel might be too old
I use latest zen kernel (Linux 6.9.4-zen1-1-zen ) i don't think it's kernel issue because that number is too large and make impossible to think kernel could do that since that error log only appears on every boot time it might be a very old bug
try to convince your vendor to fix the firmware
I did that but i don't think samsung give me any response . because that device is a little bit old and samsung is not good on supporting
number goes up every time i reboot the system
If this is not the only source, you need to figure out what is calling nvme-cli/libnvme and issuing what commands, e.g. check libudisk2. nvme list
will not issue any commands directly. The kernel still might in behalf of libnvme though.
check libudisk2.
sorry how can i do that
nvme list
sudo nvme list Node Generic SN Model Namespace Usage Format FW Rev
/dev/nvme0n1 /dev/ng0n1 S4EVNF0M885724F Samsung SSD 970 EVO Plus 500GB 0x1 84.17 GB / 500.11 GB 512 B + 0 B 2B2QEXM7
if there is something wrong with this, please tell me in detail
I'm having the same issue but with Crucial P3 CT4000P3SSD8. I have two same ones running in the ZFS mirror pool on Ubuntu 22.04 (kernel 5.15.0-67-generic) and every single system restart adds 2 errors to the SMART error log for each drive - 0x2002(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field).
Annoying.
@Rychu-Pawel I have the same SSD, and every time I execute sudo smartctl -a /dev/nvme0n1
, one log entry is added to the list :D. It seems some other things are also adding log entries, but I haven't identified them yet.
System: Fedora 40 (default BTRFS install on that SSD) Kernel: 6.9.9-200.fc40
If I understand correctly what @keithbusch wrote:
the error is likely indicating the driver or some tool attempted a harmless optional command that your controller doesn't support
this shouldn't be something to worry about?
Hello!
Problem Definition
I have been trying to check my nvme for errors as I suspect that something is quite very off due to some software failing in completely unpredictable ways which indicate disk problems.
I see 12 errors in the smart report and would like to inspect them with
$nvme error-log /dev/nvme0n1
But then this gives me the following:
And I have not idea how to read it or what it is.
Details
$nvme list
Addendum
And this is another question. I ran the short smart test with
nvme device-self-test /dev/nvme0n1 -s 1
and then asked the result with this:
$nvme self-test-log /dev/nvme0n1
How do the results read? Where can I find more information on how to read them?