truenas / py-SMART

Wrapper for smartctl (smartmontools)
GNU Lesser General Public License v2.1
79 stars 35 forks source link

NVMe Windows self-test seems a bit off #64

Closed csm10495 closed 1 year ago

csm10495 commented 1 year ago

Describe the bug When trying to do a self-test with NVMe on Windows, it seems to act a bit weird. For example:

smartctl can see device self test runs:

smartctl -a /dev/sda -i
...
Self-test Log (NVMe Log 0x06)
Self-test status: Extended self-test in progress (16% completed)
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID SCT Code
 0   Extended          Completed without error                3441            -     -   -    -
 1   Short             Completed without error                3441            -     -   -    -
 2   Short             Completed without error                1361            -     -   -    -
 3   Extended          Completed without error                 866            -     -   -    -
 4   Short             Aborted: Self-test command              865            -     -   -    -
 5   Short             Completed without error                 340            -     -   -    -

pySMART can't see the same

In [15]: d.dev_reference
Out[15]: '/dev/sda'

In [16]: d.update()

In [17]: d.tests
Out[17]: []

In [18]: d.get_selftest_result()
Out[18]: (2, 'No new self-test results found.', None)

In [19]: import subprocess; subprocess.call('smartctl -l selftest /dev/sda -d nvme', shell=True)
smartctl pre-7.4 2023-03-21 r5470 [x86_64-w64-mingw32-w10-22H2] (CircleCI)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

Self-test Log (NVMe Log 0x06)
Self-test status: Extended self-test in progress (59% completed)
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID SCT Code
 0   Extended          Completed without error                3441            -     -   -    -
 1   Short             Completed without error                3441            -     -   -    -
 2   Short             Completed without error                1361            -     -   -    -
 3   Extended          Completed without error                 866            -     -   -    -
 4   Short             Aborted: Self-test command              865            -     -   -    -
 5   Short             Completed without error                 340            -     -   -    -

Out[19]: 0

In [20]: d.get_selftest_result()
Out[20]: (2, 'No new self-test results found.', None)

In [21]: d.update()

In [22]: d.tests
Out[22]: []

In [23]: d
Out[23]: <NVME device on /dev/sda mod:Samsung SSD 980 PRO 1TB sn:S5P2NG0NB05964V>

Raw outputs Please, provide the following raw command outputs to be added to our test-base. Notify explicitly if you don't want them to be incorporated. In any case, they are "mostly required" to check and verify your issue/bug.

C:\WINDOWS\system32>smartctl --scan-open
/dev/sda -d nvme # /dev/sda, NVMe device
/dev/sdb -d nvme # /dev/sdb, NVMe device
/dev/sdc -d scsi # /dev/sdc, SCSI device
C:\WINDOWS\system32>smartctl --all /dev/sda -d nvme
smartctl pre-7.4 2023-03-21 r5470 [x86_64-w64-mingw32-w10-22H2] (CircleCI)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 980 PRO 1TB
Serial Number:                      S5P2NG0NB05964V
Firmware Version:                   5B2QGXA7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      6
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Utilization:            471,137,140,736 [471 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 bb01501f70
Local Time is:                      Sat Mar 25 16:14:42 2023 PDT
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x0057):     Comp Wr_Unc DS_Mngmt Sav/Sel_Feat Timestmp
Log Page Attributes (0x0f):         S/H_per_NS Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     8.49W       -        -    0  0  0  0        0       0
 1 +     4.48W       -        -    1  1  1  1        0     200
 2 +     3.18W       -        -    2  2  2  2        0    1000
 3 -   0.0400W       -        -    3  3  3  3     2000    1200
 4 -   0.0050W       -        -    4  4  4  4      500    9500

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        62 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    13%
Data Units Read:                    151,288,508 [77.4 TB]
Data Units Written:                 264,489,347 [135 TB]
Host Read Commands:                 5,047,031,068
Host Write Commands:                7,842,130,574
Controller Busy Time:               19,134
Power Cycles:                       662
Power On Hours:                     3,441
Unsafe Shutdowns:                   36
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               62 Celsius
Temperature Sensor 2:               72 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: Extended self-test in progress (28% completed)
Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID SCT Code
 0   Extended          Completed without error                3441            -     -   -    -
 1   Short             Completed without error                3441            -     -   -    -
 2   Short             Completed without error                1361            -     -   -    -
 3   Extended          Completed without error                 866            -     -   -    -
 4   Short             Aborted: Self-test command              865            -     -   -    -
 5   Short             Completed without error                 340            -     -   -    -

Environmental setup: Note i'm using a nightly build of smartctl since the latest released doesn't work properly with nvme self-tests.

C:\WINDOWS\system32>smartctl --version
smartctl pre-7.4 2023-03-21 r5470 [x86_64-w64-mingw32-w10-22H2] (CircleCI)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

smartctl comes with ABSOLUTELY NO WARRANTY. This is free
software, and you are welcome to redistribute it under
the terms of the GNU General Public License; either
version 2, or (at your option) any later version.
See https://www.gnu.org for further details.

smartmontools pre-release 7.4
smartmontools SVN rev 5470 dated 2023-03-21 at 19:47:46
smartmontools build host: x86_64-w64-mingw32
smartmontools build with: C++11, GCC 7.3-win32 20180312, MinGW-w64 5.0
smartmontools configure arguments: 'build_alias=x86_64-pc-linux-gnu' 'host_alias=x86_64-w64-mingw32'

C:\WINDOWS\system32>pip freeze | findstr -i smart
pySMART==1.2.3

Additional context Optimally it would be nice to be able to run and query self tests from py-SMART

ralequi commented 1 year ago

Hi Charles @csm10495,

I'm going pick your raw data and check what is going wrong.

In any case, to your comment

Optimally it would be nice to be able to run and query self tests from py-SMART

Note that there are methods test_start & test_stop exactly to do what you are requesting on the pysmart object and run_selftest on the device object

ralequi commented 1 year ago

Just to take into account:

ralequi commented 1 year ago

Hi again @csm10495

Please, check the develop branch and tell me if everything works fine now.

csm10495 commented 1 year ago

Cool. Getting closer: though i noticed printing a test doesn't work:

In [13]: t = d.tests[0]

In [14]: t
Out[14]: <SMART Self-test [Extended|Completed without error] hrs:3442 LBA:None>

In [15]: print(t)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In [15], line 1
----> 1 print(t)

File ~\Desktop\py-SMART\pySMART\testentry.py:170, in TestEntry.__str__(self)
    153     return ("{0:>2} {1:17}{2:23}{3:7}{4:14}[{5:4}{6:5}{7:4}]".format(
    154         self.num,
    155         self.type,
   (...)
    161         self.ASCQ
    162     ))
    163 elif self._format == 'nvme':
    164     ## NVME FORMAT ##
    165     # Example smartctl output
   (...)
    168     # Num  Test_Description  Status                       Power_on_Hours  Failing_LBA  NSID SCT Code
    169     #  0   Extended          Completed without error                3441            -     -   -    -
--> 170     return ("{0:>2} {1:18}{2:29}{3:16}{4:13}{5:5}{6:4}{7:4}".format(
    171         self.num,
    172         self.type,
    173         self.status,
    174         self.hours,
    175         self.LBA,
    176         self.nsid,
    177         self.sct,
    178         self.code
    179     ))

TypeError: unsupported format string passed to NoneType.__format__

In [16]: d
Out[16]: <NVME device on /dev/sda mod:Samsung SSD 980 PRO 1TB sn:S5P2NG0NB05964V>

In [17]: d.tests
Out[17]:
[<SMART Self-test [Extended|Completed without error] hrs:3442 LBA:None>,
 <SMART Self-test [Extended|Completed without error] hrs:3441 LBA:None>,
 <SMART Self-test [Short|Completed without error] hrs:3441 LBA:None>,
 <SMART Self-test [Short|Completed without error] hrs:1361 LBA:None>,
 <SMART Self-test [Extended|Completed without error] hrs:866 LBA:None>,
 <SMART Self-test [Short|Aborted: Self-test command] hrs:865 LBA:None>,
 <SMART Self-test [Short|Completed without error] hrs:340 LBA:None>]

I also tried doing a run_selftest_and_wait() and it looks like it returned the last run test immediately instead of waiting for the new test to start. Running it again gave back 'self test already in progress'.:

In [18]: d.run_selftest_and_wait('short')
Out[18]: (0, <SMART Self-test [Extended|Completed without error] hrs:3442 LBA:None>)

In [19]: d.run_selftest_and_wait('short')
Out[19]: (1, 'Self-test already in progress. Please wait.')

When running a long test, i'm not quite sure it did anything, though it should have worked:

In [44]: d.run_selftest_and_wait('long')
Out[44]: (0, <SMART Self-test [Short|Completed without error] hrs:3473 LBA:None>)

In [45]: d.run_selftest_and_wait('long')
Out[45]: (0, <SMART Self-test [Short|Completed without error] hrs:3473 LBA:None>)

In [46]: d.test_capabilities
Out[46]:
{'offline': False,
 'short': True,
 'long': True,
 'conveyance': False,
 'selective': False}
ralequi commented 1 year ago

Hi @csm10495

I made a couple of fixes. I think it should work now, or at least everything I've tested does. Please, check them when you have time.

csm10495 commented 1 year ago

Getting closer. Now it seems like it ran a test, but only once. Now if I try to run another, it seems to immediately pass but not add to d.tests which makes me think its not kicking the test again.

In [33]: d.run_selftest_and_wait('short')
Out[33]: (0, <SMART Self-test [Short|Completed without error] hrs:3481 LBA:None>)

In [34]: len(d.tests)
Out[34]: 9

In [35]: d.run_selftest_and_wait('short')
Out[35]: (0, <SMART Self-test [Short|Completed without error] hrs:3481 LBA:None>)

In [36]: len(d.tests)
Out[36]: 9

In [37]: d.run_selftest_and_wait('short')
Out[37]: (0, <SMART Self-test [Short|Completed without error] hrs:3481 LBA:None>)

In [38]: len(d.tests)
Out[38]: 9

In [39]: print(d.tests[0])
 0   Short             Completed without error                3481            -     -   -    -

In [40]: d.update()

In [41]: len(d.tests)
Out[41]: 9

...

In [44]: d.run_selftest_and_wait('long')
Out[44]: (0, <SMART Self-test [Short|Completed without error] hrs:3481 LBA:None>)

In [45]: d.update()

In [46]: len(d.tests)
Out[46]: 9
ralequi commented 1 year ago

Hi @csm10495

Locally, when I've tested it, it worked perfectly. 9 may be a limit in how many tests are reported on smartctl.

Could you please check that's correct? I don't see any logic path on the code that may it work only once. You can also test "short" and "long" alternatively so it's easier to view if new tests are appearing.

csm10495 commented 1 year ago

I think smartctl itself is messing something up here. It doesn't seem to want to actually kick the tests, but when i use Samsung Magician to kick the test, it can report the test status just fine. Similarly aborting the test doesn't do anything with smartctl but works fine with Magician. Looks like the tests appear in smartctl's listing after completion/abortion still.

I'm guessing that the issues I'm seeing are directly because smartctl (and not necessarily py-SMART) is acting funky.

Here's a video to show the full behavior and testing:

https://user-images.githubusercontent.com/5749838/228715935-66c36cce-36f0-4913-80bb-f5416aa38e3a.mp4

csm10495 commented 1 year ago

I think from py-SMART's perspective ya did what you could. I've filed: https://www.smartmontools.org/ticket/1708 with smartmontools. I think this is ready to be closed out.

Thanks!

ralequi commented 1 year ago

Thank you Charles!

Don't hesitate to open any new issue you found out there!