Closed 007revad closed 5 hours ago
Interesting, I will take a look at that, not sure why that is occuring. I obviously need to test the short tests more as I have focused more on the long tests
I've been running short tests because long tests take to long.
yea i was just doing (outside of the script in SSH) cancel commands to stop the long tests
also, i noticed the email says "extended" even if the test is short, i will have to fix that
just tested the short tests on my spare DS920. manually started the short tests using only the web interface. all tests started, and stopped normally. i received the emails on both start and finish
wonder if it is this line
#extract the status, IE is s SMART test active or not?
disk_smart_status=$(echo "$raw_data" | grep -A 1 "Self-test execution status:" | tr '\n' ' ') #get SMART status for the disk
perhaps your SMART version is different and or reporting something else?
When do the start_short_sata1.txt
files get deleted?
I just noticed that when I ran the script an hour later it started the short tests again, and the start_short_sata1.txt
files were still in temp. But when I ran the script just now it deleted those files.
On a Synology I use grep " Self-test routine in progress"
which gets the percentage remaining.
percentleft=$(smartctl -a -d sat -T permissive /dev/"$drive" | grep " Self-test routine in progress" | cut -d " " -f9-13)
On an Asustor I use grep "ScanStatus"
which gets the percentage done.
percentdone=$(smartctl -a /dev/sd$1 | grep "ScanStatus" | cut -d " " -f3-4)
Though now I think the difference may not be DSM vs ADM but different drive brands.
My drives are showing differences:
Synology Drive Slot: 1 [Main Unit]
Disk: /dev/sata1
Testing is already in progress.
Percent complete: 0%
Synology Drive Slot: 2 [Main Unit]
Disk: /dev/sata2
Testing is already in progress.
Percent complete: 0%
Synology Drive Slot: 1 [Expansion Unit 1]
Disk: /dev/sata3
Testing is already in progress.
Percent complete: 0%
Synology USB Disk 3
Disk: /dev/usb1
Testing is already in progress.
Percent complete: 100%
I'm running some tests and will report back shortly...
The start_short file gets deleted right when the short and long tests are started, or if the tests are already in progress and the user submits them in the web interface
Starting at line 568
#do some cleanup in case the web-interface is trying to command the disk to do something when SMART is disabled
if [ -r "$temp_dir/cancel_$disk_temp_file_name" ]; then
rm "$temp_dir/cancel_$disk_temp_file_name"
fio
if [ -r "$temp_dir/start_long_$disk_temp_file_name" ]; then
rm "$temp_dir/start_long_$disk_temp_file_name"
fi
if [ -r "$temp_dir/start_short_$disk_temp_file_name" ]; then
rm "$temp_dir/start_short_$disk_temp_file_name"
fi
Or line 662
#perform some temp file clean up
if [ -r "$temp_dir/start_long_$disk_temp_file_name" ]; then
rm "$temp_dir/start_long_$disk_temp_file_name"
elif [ -r "$temp_dir/start_short_$disk_temp_file_name" ]; then
rm "$temp_dir/start_short_$disk_temp_file_name"
fi
On line 674
#command the test to start
if [ -r "$temp_dir/start_long_$disk_temp_file_name" ]; then
smartctl -d sat -a -t long "${disk_names[$xx]}" 2>/dev/null
manual_test_refresh_tracker=1
rm "$temp_dir/start_long_$disk_temp_file_name"
elif [ -r "$temp_dir/start_short_$disk_temp_file_name" ]; then
smartctl -d sat -a -t short "${disk_names[$xx]}" 2>/dev/null
manual_test_refresh_tracker=1
rm "$temp_dir/start_short_$disk_temp_file_name"
fi
Okay, they being recreated when I refresh the browser. Chrome actually warned me but I just clicked Continue without reading the warning. It would be nice if you could add a Reload button that will repopulate the text boxes and tables without resubmitting.
I just started short tests on my DS720+'s 4 drives and the following worked as it should:
smartctl -a -d sat "$drive" | grep -A 1 "Self-test execution status:"
While the test is running that command returns:
Self-test execution status: ( 249) Self-test routine in progress...
90% of test remaining.
And when the test has finished:
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
Except for my dodgy USB drive which shows 00% remaining:
Self-test execution status: ( 240) Self-test routine in progress...
00% of test remaining.
So all my refreshing of the web page, then running the script, was causing the short tests to run again.
Ok, that makes sense.
I will see about adding a reload button, should be easy to do to prevent this issue again
for your dodgy USB disk, can you send me the RAW smartctl results? i would like to compare its results to everything else, try to get to the bottom of that drive...
That dodgy drive, that's in a USB dock, is a Super Talent 64GB SSD from 2008.
Dodgy USB drive's RAW smartctl results:
root@SENNA:~# smartctl -a -d sat /dev/usb1
smartctl 6.5 (build date Sep 26 2022) [x86_64-linux-4.4.302+] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Indilinx Barefoot based SSDs
Device Model: STT_FTM64GX25H
Serial Number: redacted (not sure I redacted the serial as it's 16 years old)
Firmware Version: 1819
User Capacity: 64,023,257,088 bytes [64.0 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: In smartctl database [for details use: -P show]
ATA Version is: ATA8-ACS (minor revision not indicated)
Local Time is: Thu Nov 14 07:52:09 2024 AEDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 16) The self-test routine was aborted by
the host.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x1d) SMART execute Offline immediate.
No Auto Offline data collection support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x00) Error logging NOT supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 0) minutes.
Extended self-test routine
recommended polling time: ( 0) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0000 --- --- --- Old_age Offline - 7
9 Power_On_Hours 0x0000 --- --- --- Old_age Offline - 73903
12 Power_Cycle_Count 0x0000 --- --- --- Old_age Offline - 0
184 Initial_Bad_Block_Count 0x0000 --- --- --- Old_age Offline - 22
195 Program_Failure_Blk_Ct 0x0000 --- --- --- Old_age Offline - 17
196 Erase_Failure_Blk_Ct 0x0000 --- --- --- Old_age Offline - 4
197 Read_Failure_Blk_Ct 0x0000 --- --- --- Old_age Offline - 0
198 Read_Sectors_Tot_Ct 0x0000 --- --- --- Old_age Offline - 21523009830
199 Write_Sectors_Tot_Ct 0x0000 --- --- --- Old_age Offline - 34484549447
200 Read_Commands_Tot_Ct 0x0000 --- --- --- Old_age Offline - 393521368
201 Write_Commands_Tot_Ct 0x0000 --- --- --- Old_age Offline - 906905574
202 Error_Bits_Flash_Tot_Ct 0x0000 --- --- --- Old_age Offline - 4455533
203 Corr_Read_Errors_Tot_Ct 0x0000 --- --- --- Old_age Offline - 3487764
204 Bad_Block_Full_Flag 0x0000 --- --- --- Old_age Offline - 0
205 Max_PE_Count_Spec 0x0000 --- --- --- Old_age Offline - 10000
206 Min_Erase_Count 0x0000 --- --- --- Old_age Offline - 1
207 Max_Erase_Count 0x0000 --- --- --- Old_age Offline - 200783
208 Average_Erase_Count 0x0000 --- --- --- Old_age Offline - 30706
209 Remaining_Lifetime_Perc 0x0000 --- --- --- Old_age Offline - 81
211 SATA_Error_Ct_CRC 0x0000 --- --- --- Old_age Offline - 0
212 SATA_Error_Ct_Handshake 0x0000 --- --- --- Old_age Offline - 0
213 Indilinx_Internal 0x0000 --- --- --- Old_age Offline - 0
Warning! SMART ATA Error Log Structure error: invalid SMART checksum.
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Aborted by host 00% 255 -
# 2 Extended offline Aborted by host 00% 255 -
Selective Self-tests/Logging not supported
when it gets stuck in the "00%" do you know what the value was for the line
Self-test execution status: ( 16) The self-test routine was aborted by
the host.
??? as that is what the script uses to know if test is in progress, so i am wondering if it never changes state? it is possible the test is at 100% but the test is stuck. i have seen server drives get stuck for days and even weeks at either 90% or 100% marks.... you mentioned that this drive already has corruption.
That line was
Self-test execution status: ( 240) Self-test routine in progress...
00% of test remaining.
This SSD drive was in a media player and contained some text config files that starting refusing to load, showing weird characters when viewed from Windows, so I suspect it may have bad sectors (worn cells).
I think this issue can be closed then?
I ticked the stop test box then submit in the webui then refreshed the ui and cleared the "test running" state.
I used the webui to manually start a short SMART on each drive, which finish within 2 minutes.
20 minutes later the webui thinks they are still running:
The history logs say the tests as started:
I received an email for each drive saying the test had started, like:
I did not receive any emails saying the tests had stopped. Storage Manager shows that the tests have finished.