linux-nvme / nvme-cli

NVMe management command line interface.
https://nvmexpress.org
GNU General Public License v2.0
1.47k stars 653 forks source link

nvme error-log 0x2002 INVALID_FIELD #1224

Closed LefterisJP closed 2 years ago

LefterisJP commented 2 years ago

Hello!

Problem Definition

I have been trying to check my nvme for errors as I suspect that something is quite very off due to some software failing in completely unpredictable ways which indicate disk problems.

I see 12 errors in the smart report and would like to inspect them with $nvme error-log /dev/nvme0n1

But then this gives me the following:


Error Log Entries for device:nvme0n1 entries:64                                                                                         
.................                                                                                                                                                                                                                                                                
 Entry[ 0]                                                                                                                                                                                                                                                                       
.................                                                                                                                       
error_count     : 12                                                                                                                    
sqid            : 0                                                                                                                     
cmdid           : 0x1002                                                                                                                
status_field    : 0x2002(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)                              
phase_tag       : 0                                                                                                                     
parm_err_loc    : 0xffff                                                                                                                
lba             : 0                                                                                                                     
nsid            : 0                                                                                                                     
vs              : 0                                                                                                                     
trtype          : The transport type is not indicated or the error is not transport related.                                            
cs              : 0                                                                                                                     
trtype_spec_info: 0                                                                                                                     
.................   

And I have not idea how to read it or what it is.

Details

$nvme list

Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev  
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          S4J4NZFN902470P      Samsung SSD 970 EVO Plus 2TB             1           1.89  TB /   2.00  TB    512   B +  0 B   2B2QEXM7
nvme version 1.15
5.14.16-arch1-1 #1 SMP PREEMPT Tue, 02 Nov 2021 22:22:59 +0000 x86_64 GNU/Linux

Addendum

And this is another question. I ran the short smart test with nvme device-self-test /dev/nvme0n1 -s 1

and then asked the result with this: $nvme self-test-log /dev/nvme0n1

Device Self Test Log for NVME device:nvme0n1
Current operation  : 0
Current Completion : 0%
Self Test Result[0]:
  Operation Result             : 0
  Self Test Code               : 1
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x1391
  Vendor Specific              : 0 0

How do the results read? Where can I find more information on how to read them?

keithbusch commented 2 years ago

The sqid is the admin queue, so the error is likely indicating the driver or some tool attempted a harmless optional command that your controller doesn't support. The device vendor is known to be overly pedantic about logging these errors. The spec allows this behavior, but it isn't required to do that, nor is it helpful to anyone.

LefterisJP commented 2 years ago

Hey @keithbusch thank you for your very swift response!

So I see this is not something you guys could do on thenvme-cli side then. I would need to update firmware of the vendor?

Did you see my question on the self-test part? I don't want to switch this to IT support, but is there any advice on how to test this drive for errors and read/verify if there are any? I am trying to decide where errors lie to see if I need to change hardware and things point to the disk and not the RAM.

keithbusch commented 2 years ago

I'd need to consult the spec, but I don't have it handy at the moment (using my phone at the moment). The spec is free to download from nvmexpess.org it you want to view the source that defines this log page.

igaw commented 2 years ago

Seem this issue got stale. Closing it.

Massimo-B commented 2 years ago

Absolutely same issue with same device, just 1TB size and nvme version 1.16. Any news about that and what we can do about it? smartd continuously reports via email: ... number of Error Log entries increased from 117 to 118

igaw commented 2 years ago

As Keith said, it's likely to be caused by an harmless optional command which gets logged by firmware. Do you happen do see anything in the kernel logs?

Massimo-B commented 2 years ago

Nothing in the kernel logs.

igaw commented 2 years ago

Let's try to figure out which command it could be. Maybe we see it in the nvme trace:

cd /sys/kernel/debug/tracing/
echo 'status!=0' > events/nvme/nvme_complete_rq/filter
echo 1 > events/nvme/nvme_complete_rq/enable
echo 1 > events/nvme/nvme_async_event/enable
echo 1 > tracing_on

[wait for the failure]

echo 0 > tracing_on
cat trace > ~/nvme_complete_rq-trace.txt
Massimo-B commented 2 years ago

Which kernel option do I need to have /sys/kernel/debug/tracing/ ?

igaw commented 2 years ago

I think CONFIG_FTRACE is enough but in doubt you can enable most of tracers options (except the self tests)

panlinux commented 2 years ago

I see that INVALID-FIELD error in the error-log whenever I suspend/resume the laptop, fwiw.

panlinux commented 2 years ago

Nothing in the trace file after I completed the steps from https://github.com/linux-nvme/nvme-cli/issues/1224#issuecomment-1196488941

igaw commented 2 years ago

No clue, though it sounds more like a firmware issue. I see that there are bunch of Samsung devices which rely on a quirk to delay the check before the device is ready (NVME_QUIRK_DELAY_AMOUNT). It might be worth an experiment to add the quirk to driver for this device.

igaw commented 2 years ago

If I am not complete mistaken the fix should be:

https://git.infradead.org/nvme.git/commitdiff/e6487833182a8a0187f0292aca542fc163ccd03e

panlinux commented 2 years ago

Hm, I seem to have that patch applied. I'm running the ubuntu jammy kernel, https://git.launchpad.net/~ubuntu-kernel/ubuntu/+source/linux/+git/jammy/tree/drivers/nvme/host/core.c#n2501

5.15.0-48-generic #54-Ubuntu

panlinux commented 2 years ago

my disk is a SAMSUNG MZVLW1T0HMLH-000L7 with firmware 7L7QCXY7 (PCI id 144d:a804 in lspci)

igaw commented 2 years ago

The matching logic is not just on the PCI id, it's also matching against mn and fr (model, firmware rev). The disk you seem to have is not identically to the one mention in the quirk.

What is the output from nvme id-ctrl? E.g.

# nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid       : 0x1179
ssvid     : 0x1179
sn        : 49MA23BZK03N        
mn        : KXG60ZNV512G NVMe TOSHIBA 512GB         
fr        : 10604107
panlinux commented 2 years ago

What is the output from nvme id-ctrl

# nvme id-ctrl /dev/nvme0
NVME Identify Controller:
vid       : 0x144d
ssvid     : 0x144d
sn        : S35ANX0J711457      
mn        : SAMSUNG MZVLW1T0HMLH-000L7              
fr        : 7L7QCXY7
rab       : 2
ieee      : 002538
cmic      : 0
mdts      : 0
cntlid    : 0x2
ver       : 0x10200
rtd3r     : 0x186a0
rtd3e     : 0x4c4b40
oaes      : 0
ctratt    : 0
rrls      : 0
cntrltype : 0
fguid     : 
crdt1     : 0
crdt2     : 0
crdt3     : 0
nvmsr     : 0
vwci      : 0
mec       : 0
oacs      : 0x17
acl       : 7
aerl      : 3
frmw      : 0x16
lpa       : 0x3
elpe      : 63
npss      : 4
avscc     : 0x1
apsta     : 0x1
wctemp    : 342
cctemp    : 345
mtfa      : 0
hmpre     : 0
hmmin     : 0
tnvmcap   : 1024209543168
unvmcap   : 0
rpmbs     : 0
edstt     : 35
dsto      : 0
fwug      : 0
kas       : 0
hctma     : 0
mntmt     : 0
mxtmt     : 0
sanicap   : 0
hmminds   : 0
hmmaxd    : 0
nsetidmax : 0
endgidmax : 0
anatt     : 0
anacap    : 0
anagrpmax : 0
nanagrpid : 0
pels      : 0
domainid  : 0
megcap    : 0
sqes      : 0x66
cqes      : 0x44
maxcmd    : 0
nn        : 1
oncs      : 0x1f
fuses     : 0
fna       : 0x4
vwc       : 0x1
awun      : 255
awupf     : 0
icsvscc     : 1
nwpc      : 0
acwu      : 0
ocfs      : 0
sgls      : 0
mnan      : 0
maxdna    : 0
maxcna    : 0
subnqn    : 
ioccsz    : 0
iorcsz    : 0
icdoff    : 0
fcatt     : 0
msdbd     : 0
ofcs      : 0
ps    0 : mp:7.60W operational enlat:0 exlat:0 rrt:0 rrl:0
          rwt:0 rwl:0 idle_power:- active_power:-
ps    1 : mp:6.00W operational enlat:0 exlat:0 rrt:1 rrl:1
          rwt:1 rwl:1 idle_power:- active_power:-
ps    2 : mp:5.10W operational enlat:0 exlat:0 rrt:2 rrl:2
          rwt:2 rwl:2 idle_power:- active_power:-
ps    3 : mp:0.0400W non-operational enlat:210 exlat:1500 rrt:3 rrl:3
          rwt:3 rwl:3 idle_power:- active_power:-
ps    4 : mp:0.0050W non-operational enlat:2200 exlat:6000 rrt:4 rrl:4
          rwt:4 rwl:4 idle_power:- active_power:-
igaw commented 2 years ago

Right, the corresponding quirk entry would be:

diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 1a57b6392ee3..87fce7c46bd6 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -2666,7 +2666,12 @@ static const struct nvme_core_quirk_entry core_quirks[] = {
                .quirks = NVME_QUIRK_DELAY_BEFORE_CHK_RDY |
                          NVME_QUIRK_NO_DEEPEST_PS |
                          NVME_QUIRK_IGNORE_DEV_SUBNQN,
-       }
+       },
+       {
+               .vid = 0x144d,
+               .mn = "SAMSUNG MZVLW1T0HMLH-000L7",
+               .quirks = NVME_QUIRK_DELAY_BEFORE_CHK_RDY,
+       },
 };

Could try this patch on your kernel?

panlinux commented 2 years ago

Sorry, I don't think that worked. Granted, patching a distribution kernel and booting off its packages is not as trivial as I thought, and I had to disable secure boot, but I do think I'm running the patched kernel and its modules, and the error count still increases by one after a suspend/resume cycle.

Thanks for trying to help me, though! :)

igaw commented 2 years ago

Sadly, 'security' makes everything really complex. Anyway, I would suggest to post these result on the nvme mailing list, CCing those guys which did the entry for the Samsung X5 model. Maybe the know more about it.

keithbusch commented 2 years ago

Sadly, 'security' makes everything really complex. Anyway, I would suggest to post these result on the nvme mailing list, CCing those guys which did the entry for the Samsung X5 model. Maybe the know more about it.

This may need a quick side discussion at an alpine hut. Are you free on Tuesday? :)

igaw commented 2 years ago

On Sun, Oct 09, 2022 at 06:54:36AM -0700, Keith Busch wrote:

Sadly, 'security' makes everything really complex. Anyway, I would suggest to post these result on the nvme mailing list, CCing those guys which did the entry for the Samsung X5 model. Maybe the know more about it.

This may need a quick side discussion at an alpine hut. Are you free on Tuesday? :)

Sure, equipped with a beer :)

Rychu-Pawel commented 1 year ago

I'm having the same issue but with Crucial P3 CT4000P3SSD8. I have two same ones running in the ZFS mirror pool on Ubuntu 22.04 (kernel 5.15.0-67-generic) and every single system restart adds 2 errors to the SMART error log for each drive - 0x2002(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field).

Annoying.

reefland commented 1 year ago

Ubuntu 22.04.3, Kernel: 5.15.0-79-generic.

$ sudo nvme list

Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev  
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          S59CNM0W436651E      Samsung SSD 970 EVO Plus 2TB             1         340.03  GB /   2.00  TB    512   B +  0 B   2B2QEXM7
$ sudo nvme error-log -e 1 /dev/nvme0

Error Log Entries for device:nvme0 entries:1
.................
 Entry[ 0]   
.................
error_count     : 28
sqid            : 0
cmdid           : 0x4018
status_field    : 0x2002(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
phase_tag       : 0
parm_err_loc    : 0xffff
lba             : 0
nsid            : 0
vs              : 0
trtype          : The transport type is not indicated or the error is not transport related.
cs              : 0
trtype_spec_info: 0
.................
$ sudo nvme self-test-log --dst-entries 1 -v /dev/nvme0

Device Self Test Log for NVME device:nvme0
Current operation  : 0
Current Completion : 0%
Self Test Result[0]:
  Operation Result             : 0 Operation completed without error
  Self Test Code               : 1 Short device self-test operation
  Valid Diagnostic Information : 0
  Power on hours (POH)         : 0x2e
  Vendor Specific              : 0 0
archenroot commented 9 months ago
zangetsu  X10SRA  ~  sudo nvme list
Node                  SN                   Model                                    Namespace Usage                      Format           FW Rev  
--------------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1          S3ETNX0J101327H      Samsung SSD 960 EVO 1TB                  1         272,62  GB /   1,00  TB    512   B +  0 B   1B7QCXE7
/dev/nvme1n1          S649NJ0R208640Z      Samsung SSD 980 1TB                      1         473,03  GB /   1,00  TB    512   B +  0 B   1B4QFXO7
/dev/nvme2n1          50026B728283D8FC     KINGSTON SKC2500M81000G                  1         472,69  GB /   1,00  TB    512   B +  0 B   S7780101
/dev/nvme3n1          S649NJ0R210760L      Samsung SSD 980 1TB                      1         472,92  GB /   1,00  TB    512   B +  0 B   1B4QFXO7

I got errors logged only on my boot disc which is Samsung SSD 960 EVO 1TB

zangetsu  X10SRA  ~  sudo smartctl -a /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-6.2.0-39-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 960 EVO 1TB
Serial Number:                      S3ETNX0J101327H
Firmware Version:                   1B7QCXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1 000 204 886 016 [1,00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      2
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1 000 204 886 016 [1,00 TB]
Namespace 1 Utilization:            272 623 296 512 [272 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5171b064ee
Local Time is:                      Mon Dec 18 22:15:19 2023 CET
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     77 Celsius
Critical Comp. Temp. Threshold:     79 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.04W       -        -    0  0  0  0        0       0
 1 +     5.09W       -        -    1  1  1  1        0       0
 2 +     4.08W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1500
 4 -   0.0050W       -        -    4  4  4  4     2200    6000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        28 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    15 030 470 [7,69 TB]
Data Units Written:                 43 606 695 [22,3 TB]
Host Read Commands:                 330 002 480
Host Write Commands:                552 071 418
Controller Busy Time:               2 315
Power Cycles:                       751
Power On Hours:                     2 161
Unsafe Shutdowns:                   514
Media and Data Integrity Errors:    0
Error Information Log Entries:      409
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               28 Celsius
Temperature Sensor 2:               38 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0        409     0  0x000c  0x4004      -            0     0     -
  1        408     0  0x0008  0x4004  0x028            0     1     -
  2        407     0  0x0007  0x4004  0x028            0     1     -
  3        406     0  0x0006  0x4004  0x028            0     1     -
  4        405     0  0x0008  0x4004      -            0     0     -
  5        404     0  0x0008  0x4004  0x028            0     1     -
  6        403     0  0x0007  0x4004  0x028            0     1     -
  7        402     0  0x0006  0x4004  0x028            0     1     -
  8        401     0  0x5000  0x4004      -            0     0     -
  9        400     0  0x0008  0x4004  0x028            0     1     -
 10        399     0  0x0007  0x4004  0x028            0     1     -
 11        398     0  0x0006  0x4004  0x028            0     1     -
 12        397     0  0x0000  0x4004      -            0     0     -
 13        396     0  0x0008  0x4004  0x028            0     1     -
 14        395     0  0x0007  0x4004  0x028            0     1     -
 15        394     0  0x0006  0x4004  0x028            0     1     -
... (48 entries not read)

Its repeatative error, sample here:

 Entry[63]   
.................
error_count : 346
sqid        : 0
cmdid       : 0xb001
status_field    : 0x2002(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
phase_tag   : 0
parm_err_loc    : 0xffff
lba     : 0
nsid        : 0
vs      : 0
trtype      : The transport type is not indicated or the error is not transport related.
cs      : 0
trtype_spec_info: 0
.................

I will try to send it also to the nvme mailing list devs....

igaw commented 9 months ago

This should be addressed with the upcoming kernel 6.8 and the next release of libnvme (see https://github.com/linux-nvme/libnvme/issues/681)

erfan-star-1999 commented 3 months ago

hi guys i know it's been a year this theread is closet but i want say i still get this error on most distro i use like debian , opensuse tumbleweed , and Arch i'm still in arch and i didn't notice that error unitl last week and the num_err_log_entries riase up to 5526. number goes up every time i reboot the system i have samsung Evo 970 plus 500 GB

down is my error

Error Log Entries for device:nvme0n1 entries:64 ................. Entry[ 0] ................. error_count : 5526 sqid : 0 cmdid : 0x2019 status_field : 0x2002(Invalid Field in Command: A reserved coded value or an unsupported value in a defined field) phase_tag : 0 parm_err_loc : 0xffff lba : 0 nsid : 0 vs : 0 trtype : The transport type is not indicated or the error is not transport related. csi : 0 opcode : 0 cs : 0 trtype_spec_info: 0 log_page_version: 0

this is my smart-log

` Smart Log for NVME device:nvme0n1 namespace-id:ffffffff critical_warning : 0 temperature : 44 °C (317 K) available_spare : 100% available_spare_threshold : 10% percentage_used : 1% endurance group critical warning summary: 0 Data Units Read : 61521857 (31.50 TB) Data Units Written : 38671369 (19.80 TB) host_read_commands : 818138247 host_write_commands : 1091584779 controller_busy_time : 2208 power_cycles : 4966 power_on_hours : 2003 unsafe_shutdowns : 647 media_errors : 0 num_err_log_entries : 5526 Warning Temperature Time : 0 Critical Composite Temperature Time : 0 Temperature Sensor 1 : 44 °C (317 K) Temperature Sensor 2 : 42 °C (315 K) Thermal Management T1 Trans Count : 0 Thermal Management T2 Trans Count : 0 Thermal Management T1 Total Time : 0 Thermal Management T2 Total Time : 0

` sorry if i had English problem

igaw commented 3 months ago

number goes up every time i reboot the system

The kernel issues a few commands to identify the device. Although these commands are valid, this particular firmware can't handle them and logs them. nvme-cli/libnvme doesn't issue these anymore with recent kernels (you distro kernel might be too old). There were some discussion on how to handle this and if I remember correctly, the outcome was 'buggy firmware'.

In short, try to convince your vendor to fix the firmware or/and report this to the nvme linux mailing list.

erfan-star-1999 commented 3 months ago

you distro kernel might be too old

I use latest zen kernel (Linux 6.9.4-zen1-1-zen ) i don't think it's kernel issue because that number is too large and make impossible to think kernel could do that since that error log only appears on every boot time it might be a very old bug

try to convince your vendor to fix the firmware

I did that but i don't think samsung give me any response . because that device is a little bit old and samsung is not good on supporting

igaw commented 3 months ago

number goes up every time i reboot the system

If this is not the only source, you need to figure out what is calling nvme-cli/libnvme and issuing what commands, e.g. check libudisk2. nvme list will not issue any commands directly. The kernel still might in behalf of libnvme though.

erfan-star-1999 commented 3 months ago

check libudisk2.

sorry how can i do that

nvme list

sudo nvme list Node Generic SN Model Namespace Usage Format FW Rev


/dev/nvme0n1 /dev/ng0n1 S4EVNF0M885724F Samsung SSD 970 EVO Plus 500GB 0x1 84.17 GB / 500.11 GB 512 B + 0 B 2B2QEXM7

if there is something wrong with this, please tell me in detail

dr460r commented 2 months ago

I'm having the same issue but with Crucial P3 CT4000P3SSD8. I have two same ones running in the ZFS mirror pool on Ubuntu 22.04 (kernel 5.15.0-67-generic) and every single system restart adds 2 errors to the SMART error log for each drive - 0x2002(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field).

Annoying.

@Rychu-Pawel I have the same SSD, and every time I execute sudo smartctl -a /dev/nvme0n1, one log entry is added to the list :D. It seems some other things are also adding log entries, but I haven't identified them yet.

System: Fedora 40 (default BTRFS install on that SSD) Kernel: 6.9.9-200.fc40

If I understand correctly what @keithbusch wrote:

the error is likely indicating the driver or some tool attempted a harmless optional command that your controller doesn't support

this shouldn't be something to worry about?