wallacebrf / Synology-SMART-test-scheduler

Scheduler of Synology SMART tests
1 stars 1 forks source link

Script thinks manually SMART tests are still running #19

Closed 007revad closed 5 hours ago

007revad commented 1 day ago

I used the webui to manually start a short SMART on each drive, which finish within 2 minutes.

20 minutes later the webui thinks they are still running: image

The history logs say the tests as started:

Disk: /dev/sata1
Model: WD10JUCT-63CYNY0
Serial: redacted
User Capacity:    1,000,204,886,016 bytes [1.00 TB]
Test Started: 13/11/2024 12:31:01:313

Disk: /dev/sata2
Model: SSDSC2BF180A4H
Serial: redacted
User Capacity:    180,045,766,656 bytes [180 GB]
Test Started: 13/11/2024 12:31:03:933

Disk: /dev/sata3
Model: ST3500312CS
Serial: redacted
User Capacity:    500,107,862,016 bytes [500 GB]
Test Started: 13/11/2024 12:31:06:413

Disk: /dev/usb1
Model: STT_FTM64GX25H
Serial: redacted
User Capacity:    64,023,257,088 bytes [64.0 GB]
Test Started: 13/11/2024 12:31:08:833

I received an email for each drive saying the test had started, like:

12:30:59 - 
Synology Drive Slot: 1 [Main Unit]
Disk: /dev/sata1
Model: WD10JUCT-63CYNY0
Serial: redacted
User Capacity:    1,000,204,886,016 bytes [1.00 TB]

Extended SMART test was MANUALLY started.

I did not receive any emails saying the tests had stopped. Storage Manager shows that the tests have finished.

wallacebrf commented 1 day ago

Interesting, I will take a look at that, not sure why that is occuring. I obviously need to test the short tests more as I have focused more on the long tests

007revad commented 1 day ago

I've been running short tests because long tests take to long.

wallacebrf commented 1 day ago

yea i was just doing (outside of the script in SSH) cancel commands to stop the long tests

wallacebrf commented 1 day ago

also, i noticed the email says "extended" even if the test is short, i will have to fix that

wallacebrf commented 1 day ago

just tested the short tests on my spare DS920. manually started the short tests using only the web interface. all tests started, and stopped normally. i received the emails on both start and finish

wallacebrf commented 1 day ago

wonder if it is this line

#extract the status, IE is s SMART test active or not?
            disk_smart_status=$(echo "$raw_data" | grep -A 1 "Self-test execution status:" | tr '\n' ' ') #get SMART status for the disk

perhaps your SMART version is different and or reporting something else?

007revad commented 1 day ago

When do the start_short_sata1.txt files get deleted?

I just noticed that when I ran the script an hour later it started the short tests again, and the start_short_sata1.txt files were still in temp. But when I ran the script just now it deleted those files.


On a Synology I use grep " Self-test routine in progress" which gets the percentage remaining.

percentleft=$(smartctl -a -d sat -T permissive /dev/"$drive" | grep "  Self-test routine in progress" | cut -d " " -f9-13)

On an Asustor I use grep "ScanStatus" which gets the percentage done.

percentdone=$(smartctl -a /dev/sd$1 | grep "ScanStatus" | cut -d " " -f3-4)

Though now I think the difference may not be DSM vs ADM but different drive brands.

My drives are showing differences:

Synology Drive Slot: 1 [Main Unit]
Disk: /dev/sata1
Testing is already in progress.
Percent complete: 0%

Synology Drive Slot: 2 [Main Unit]
Disk: /dev/sata2
Testing is already in progress.
Percent complete: 0%

Synology Drive Slot: 1 [Expansion Unit 1]
Disk: /dev/sata3
Testing is already in progress.
Percent complete: 0%

Synology USB Disk 3
Disk: /dev/usb1
Testing is already in progress.
Percent complete: 100%

I'm running some tests and will report back shortly...

wallacebrf commented 1 day ago

The start_short file gets deleted right when the short and long tests are started, or if the tests are already in progress and the user submits them in the web interface

Starting at line 568

#do some cleanup in case the web-interface is trying to command the disk to do something when SMART is disabled
                if [ -r "$temp_dir/cancel_$disk_temp_file_name" ]; then
                    rm "$temp_dir/cancel_$disk_temp_file_name"
                fio

                if [ -r "$temp_dir/start_long_$disk_temp_file_name" ]; then
                    rm "$temp_dir/start_long_$disk_temp_file_name"
                fi

                if [ -r "$temp_dir/start_short_$disk_temp_file_name" ]; then
                    rm "$temp_dir/start_short_$disk_temp_file_name"
                fi

Or line 662

#perform some temp file clean up
                        if [ -r "$temp_dir/start_long_$disk_temp_file_name" ]; then
                            rm "$temp_dir/start_long_$disk_temp_file_name"
                        elif [ -r "$temp_dir/start_short_$disk_temp_file_name" ]; then
                            rm "$temp_dir/start_short_$disk_temp_file_name"
                        fi

On line 674

#command the test to start
                        if [ -r "$temp_dir/start_long_$disk_temp_file_name" ]; then
                            smartctl -d sat -a -t long "${disk_names[$xx]}" 2>/dev/null
                            manual_test_refresh_tracker=1
                            rm "$temp_dir/start_long_$disk_temp_file_name"
                        elif [ -r "$temp_dir/start_short_$disk_temp_file_name" ]; then
                            smartctl -d sat -a -t short "${disk_names[$xx]}" 2>/dev/null
                            manual_test_refresh_tracker=1
                            rm "$temp_dir/start_short_$disk_temp_file_name"
                        fi
007revad commented 1 day ago

Okay, they being recreated when I refresh the browser. Chrome actually warned me but I just clicked Continue without reading the warning. image It would be nice if you could add a Reload button that will repopulate the text boxes and tables without resubmitting.


I just started short tests on my DS720+'s 4 drives and the following worked as it should:

smartctl -a -d sat "$drive" | grep -A 1 "Self-test execution status:"

While the test is running that command returns:

Self-test execution status:      ( 249) Self-test routine in progress...
                                        90% of test remaining.

And when the test has finished:

Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever


Except for my dodgy USB drive which shows 00% remaining:

Self-test execution status:      ( 240) Self-test routine in progress...
                                        00% of test remaining.

So all my refreshing of the web page, then running the script, was causing the short tests to run again.

wallacebrf commented 16 hours ago

Ok, that makes sense.

I will see about adding a reload button, should be easy to do to prevent this issue again

wallacebrf commented 16 hours ago

for your dodgy USB disk, can you send me the RAW smartctl results? i would like to compare its results to everything else, try to get to the bottom of that drive...

007revad commented 7 hours ago

That dodgy drive, that's in a USB dock, is a Super Talent 64GB SSD from 2008.

Dodgy USB drive's RAW smartctl results:

root@SENNA:~# smartctl -a -d sat /dev/usb1
smartctl 6.5 (build date Sep 26 2022) [x86_64-linux-4.4.302+] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Indilinx Barefoot based SSDs
Device Model:     STT_FTM64GX25H
Serial Number:    redacted (not sure I redacted the serial as it's 16 years old)
Firmware Version: 1819
User Capacity:    64,023,257,088 bytes [64.0 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS (minor revision not indicated)
Local Time is:    Thu Nov 14 07:52:09 2024 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  16) The self-test routine was aborted by
                                        the host.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x1d) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Abort Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x00) Error logging NOT supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   0) minutes.
Extended self-test routine
recommended polling time:        (   0) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME                                                   FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate                                              0x0000   ---   ---   ---    Old_age   Offline      -       7
  9 Power_On_Hours                                                   0x0000   ---   ---   ---    Old_age   Offline      -       73903
 12 Power_Cycle_Count                                                0x0000   ---   ---   ---    Old_age   Offline      -       0
184 Initial_Bad_Block_Count                                          0x0000   ---   ---   ---    Old_age   Offline      -       22
195 Program_Failure_Blk_Ct                                           0x0000   ---   ---   ---    Old_age   Offline      -       17
196 Erase_Failure_Blk_Ct                                             0x0000   ---   ---   ---    Old_age   Offline      -       4
197 Read_Failure_Blk_Ct                                              0x0000   ---   ---   ---    Old_age   Offline      -       0
198 Read_Sectors_Tot_Ct                                              0x0000   ---   ---   ---    Old_age   Offline      -       21523009830
199 Write_Sectors_Tot_Ct                                             0x0000   ---   ---   ---    Old_age   Offline      -       34484549447
200 Read_Commands_Tot_Ct                                             0x0000   ---   ---   ---    Old_age   Offline      -       393521368
201 Write_Commands_Tot_Ct                                            0x0000   ---   ---   ---    Old_age   Offline      -       906905574
202 Error_Bits_Flash_Tot_Ct                                          0x0000   ---   ---   ---    Old_age   Offline      -       4455533
203 Corr_Read_Errors_Tot_Ct                                          0x0000   ---   ---   ---    Old_age   Offline      -       3487764
204 Bad_Block_Full_Flag                                              0x0000   ---   ---   ---    Old_age   Offline      -       0
205 Max_PE_Count_Spec                                                0x0000   ---   ---   ---    Old_age   Offline      -       10000
206 Min_Erase_Count                                                  0x0000   ---   ---   ---    Old_age   Offline      -       1
207 Max_Erase_Count                                                  0x0000   ---   ---   ---    Old_age   Offline      -       200783
208 Average_Erase_Count                                              0x0000   ---   ---   ---    Old_age   Offline      -       30706
209 Remaining_Lifetime_Perc                                          0x0000   ---   ---   ---    Old_age   Offline      -       81
211 SATA_Error_Ct_CRC                                                0x0000   ---   ---   ---    Old_age   Offline      -       0
212 SATA_Error_Ct_Handshake                                          0x0000   ---   ---   ---    Old_age   Offline      -       0
213 Indilinx_Internal                                                0x0000   ---   ---   ---    Old_age   Offline      -       0

Warning! SMART ATA Error Log Structure error: invalid SMART checksum.
SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Aborted by host               00%       255         -
# 2  Extended offline    Aborted by host               00%       255         -

Selective Self-tests/Logging not supported
wallacebrf commented 6 hours ago

when it gets stuck in the "00%" do you know what the value was for the line

Self-test execution status:      (  16) The self-test routine was aborted by
                                        the host.

??? as that is what the script uses to know if test is in progress, so i am wondering if it never changes state? it is possible the test is at 100% but the test is stuck. i have seen server drives get stuck for days and even weeks at either 90% or 100% marks.... you mentioned that this drive already has corruption.

007revad commented 5 hours ago

That line was

Self-test execution status:      ( 240) Self-test routine in progress...
                                        00% of test remaining.

This SSD drive was in a media player and contained some text config files that starting refusing to load, showing weird characters when viewed from Windows, so I suspect it may have bad sectors (worn cells).

wallacebrf commented 5 hours ago

I think this issue can be closed then?

007revad commented 5 hours ago

I ticked the stop test box then submit in the webui then refreshed the ui and cleared the "test running" state.