martijnvanbrummelen / nwipe

nwipe secure disk eraser
GNU General Public License v2.0
688 stars 79 forks source link

no temperature for SAS drives #497

Closed ggruber closed 10 months ago

ggruber commented 11 months ago

Originally posted by @PartialVolume in https://github.com/martijnvanbrummelen/nwipe/issues/426#issuecomment-1748969487

Yes, via the kernel (hwmon) would be the preferred method which nwipe already supports but it needs somebody with the SAS hardware and, proficient in C and prepared to learn about the low level access to SAS and the standards used. I'm hoping somebody that's looking for a project could add the code to hwmon to support SAS. The maintainer is happy to accept commits but doesn't have the hardware or maybe time to do this himself. If anybody wants to get involved the source for hwmon can be found here. https://github.com/torvalds/linux/blob/master/drivers/hwmon/hwmon.c

Until such time that hwmon supports SAS we may have to use smartctl or alternatively write a low level function ourselves.

ggruber commented 11 months ago

I can put it in a thread and test it, probably would take me a hour if that. I can let you know when it's done ( possibly tomorrow) and you could test it if that' ok. I'll commit your existing PR first.

would be great, will happily test that. and again: if this can help you in any way: I'd find a way to let you test directly on my test hardware

ggruber commented 11 months ago

should I add the screen-width dependend -SSD display into the open PR?

PartialVolume commented 11 months ago

Depends when you can do it, I might be committing your PR later today, not sure yet. Entirely up to you.

ggruber commented 11 months ago

while looking for the place to limit the shown length of the bustype (SATA vs. SATA-SSD): das the used window system allow for horizontal scrolling? or would that have to be implemented "manually" like the vertical scrolling of the disk list? I mean the for a true 80 column display in the layout as it was before I started interfering the code: I think already more than 26 disks could damagy the layout already be having disk names like /dev/sdaa, so maybe it would be a point to think about horizontal scrolling?

I want to suggest to bring this next version with those minor flaws we have in now to the public. From my POV only a real minority would have a 80 column display, and on those the serila number would be clipped? or wrapped?

Just want to send this bevor I swap 6 of the HPs against Intel DC S3520 SSDs.

Will be online at least another hour I think waiting/hoping for feedback.

I'm afraid the bottom line doesn't fit into 80 columns, also: 2023-10-16_3

ggruber commented 11 months ago

let's check ssd smart erasure

2023-10-16_1 2023-10-16_2

PartialVolume commented 11 months ago

Will be online at least another hour I think waiting/hoping for feedback.

Sorry, just picked this up. Yes, I can commit this with these minor issues and come back to that.

What's interesting to me are those red temperatures, I'll see if I can see what's causing that. Weird.

Regarding access to your SAS server, yes that would be useful, I wouldn't need desktop access, just ssh access. If you could setup sshd and I'll provide you with my public key. No rush. Is the system on 24/7 or would I need to notify you if I needed access?

ggruber commented 11 months ago

I can commit this with these minor issues and come back to that.

fine

no problem giving you ssh access. WIl you possibly come from an fixed IP? sys is currently 24/7 up. pls leave the OS disks sd[ab] as they are ;-) (meaning: do not wipe them)

PartialVolume commented 11 months ago

@ggruber There is also an alignment issue with the device type field, I noticed that the NVME was left justified, however I see the issue, the original field was right justified in four characters, and now it needs to be justified with 8 characters.

no problem giving you ssh access. WIl you possibly come from an fixed IP? Yes, most of the time, certainly over the next week. However next week I might be coming from a dynamic IP but it has a DDNS so if you could allow a DDNS domain.

PartialVolume commented 11 months ago

pls leave the OS disks sd[ab] as they are ;-) (meaning: do not wipe them)

Yes, of course. Which reminds me that the suggestion that was made about checking for mounted discs and issuing a warning needs to be implemented in nwipe.

I run disk wipes on my working laptop so I'm very careful about checking and double checking I'm wiping the correct disc ! :-)

ggruber commented 11 months ago

fixed IP?

just a question for the firewall, I'd like to open it as much as needed. But no prob. Find me using my home page from my profile to send PubKey. I'll send you other required infos then.

checking for mounted discs and issuing a warning

should be improved for checking for active ZFS volumes ;-)

PartialVolume commented 11 months ago

I can see from those snapshots why you would like to sort the device order, would be much nicer and easier to find a drive if they were in alphabetical order. If you fancy doing that, please feel free.

ggruber commented 11 months ago

I'll have a look into it. Tomorrow I'll have not much time for this. So let's try to get your access open right now

PartialVolume commented 11 months ago

just a question for the firewall, I'd like to open it as much as needed. But no prob. Find me using my home page from my profile to send PubKey. I'll send you other required infos then.

ok, just getting my public key and checking home page ..

PartialVolume commented 11 months ago

Done, emailed. :-)

PartialVolume commented 11 months ago

@ggruber Thanks for access to the system. Much appreciated.

Looking at the large number of discs on your system and the flashing inverted HPA/DCO message, I realised that I really don't like the look of those HPA/DCO flashing messages and I wrote it like that!. Makes it far too busy. I'll need to do something about that to make it more gentle on the eyes.

I need to just put a minimal message on the end like HPA/DCO sector/s found without all the flashing messages, could still be in red text on white to highlight it. This would make the serial numbers much easier to read.

I also noticed that one of your drive, I think it was /dev/sdh reported one hidden sector, which looking at the log may be incorrect but need to check the code.

ggruber commented 11 months ago

I need to just put a minimal message on the end like HPA/DCO sector/s found without all the flashing messages, could still be in red text on white to highlight it. This would make the serial numbers much easier to read.

I'd suggest to keep the serial numbers at the end, with am arrow marker, if it's clipped. At the end we have the dilemma to have to decide which info is important enough for an 80/132 column display.

As mentioned before: when I have to decide which disk should be (time consuming) wiped I find the information of the number of reallocated sectors very important. A disk with reallocated sectors should be destroeyed physically, imho.

PartialVolume commented 11 months ago

I'd suggest to keep the serial numbers at the end, with am arrow marker, if it's clipped. At the end we have the dilemma to have to decide which info is important enough for an 80/132 column display.

Agreed, done, just committing the change.

As mentioned before: when I have to decide which disk should be (time consuming) wiped I find the information of the number of reallocated sectors very important. A disk with reallocated sectors should be destroeyed physically, imho.

I agree with that.

That's why I like this as an optional 80 column wide alternative drive selection interface, user selectable, so if they prefer the line by line drive selection they can have that. The right hand pane would contain drive data including user selectable smart data so reallocated sectors could be seen before starting any wipes.

image

ggruber commented 11 months ago

ack, but missing in your screenshot just those reallocated sectors. Did you find the time to have a look at eraseHDD ? This program we made when preparing > 300 disks for resale: get important information on one display. That included the age of the disk/power on hours, reallocated sectors, other errors. The statistics stuff from your image above goes into a detail view, I completely agree. I'd suggest to focus on nwipe "reason to be": wipe disks so can be given away without fear of data appearing later. Supplemental information about the disks health is a thing that is good to know before starting the wiping. So my colleague and me created eraseHDD which fired up than nwipe. After passing all preflight checks.

Such a "comprehensive disk evaluation page" is a challenge of its own.

I suggest to focus for the next release on prohress in the core functionality, which we have. (possibly with the exception of the temp scanning thread)

All the other issues (some of them are from my pov to be opened) can go into the next release.

What do you think?

PartialVolume commented 11 months ago

Did you find the time to have a look at eraseHDD ?

I tried, but got the following error, it couldn't seem to find smartctl.

.Can't exec "smartctl": No such file or directory at /usr/local/lib/site_perl/readSmartData.pm line 57.

I suggest to focus for the next release on progress in the core functionality, which we have. (possibly with the exception of the temp scanning thread)

Yes, temperature thread will go into v0.35.

The red temperature text is certainly a puzzle at the moment.

ggruber commented 11 months ago

I tried, but got the following error, it couldn't seem to find smartctl.

did you try to run it as root?

just arrived @home, will come back to keyboard later.

PartialVolume commented 11 months ago

did you try to run it as root?

Doh, no I didn't ! Evidence I need to get some sleep. 😆

PartialVolume commented 11 months ago

@ggruber I think I may have found the cause of those red temperatures. Looking at the limit data below the max value matches the highest value, which is incorrect. Max should be the highest continuous operating temperature. Critical being the temperature limit when damage could occur. If either max or critical don't have data they can be left unassigned. I don't know if max is being assigned within the code or whether that's the actual data being produced by the drive.

Just wondered whether you familiar with that part of the SCSI code to see what the problem is?

[2023/10/19 22:23:44]    info: Temperature limits for /dev/sde, critical=65c, max=30c, highest=30c, lowest=30c, min=30c, low critical=-40c.
[2023/10/19 22:23:44]  notice: no hwmon data for /dev/sdc, try to get SCSI data
[2023/10/19 22:23:44]    info: got SCSI temperature data for /dev/sdc
[2023/10/19 22:23:44]  notice: get temperature for /dev/sdc took 3.085000 ms
[2023/10/19 22:23:44]    info: Temperature limits for /dev/sdc, critical=60c, max=31c, highest=31c, lowest=31c, min=31c, low critical=-40c.
[2023/10/19 22:23:44]  notice: no hwmon data for /dev/sdd, try to get SCSI data
[2023/10/19 22:23:44]    info: got SCSI temperature data for /dev/sdd
[2023/10/19 22:23:44]  notice: get temperature for /dev/sdd took 2.949000 ms
[2023/10/19 22:23:44]    info: Temperature limits for /dev/sdd, critical=60c, max=32c, highest=32c, lowest=32c, min=32c, low critical=-40c.
ggruber commented 11 months ago

for scsi drives: critical I usually get from the drive. highest, max I fill dynamically as I if there had been no values before. Should be the same on a new drive. What are the conditions to flag a temp red? is max a drive-vendor provided "warning" level? Highest is imho the highest temp the drive recorded.

PartialVolume commented 10 months ago

Skip to the conculsion at the bottom for the nub of the issue...

Here's a table of all possible colours related to temperature.

Color/status of temperature readout Meaning
White Within operating temperature
Red Maximum continuous operating temperature reached
Red Flashing Critical temperature reached, damage may occur
Black Minimum continuous operating temperature reached
Black flashing Minimum critical temperature reached, damage may occur

There are in total 7 variables that may be provided by a drive. Some drives produce all 7, some produce just one, others just three.

Here's the definition, if the drive doesn't produce all of these the missing values can be left uninitialised in the context. There is no need for the nwipe software to fill in the missing data. | Critical | Damage may occur if exceeded | Red flashing if exceeded | Max | Max continuous operating temperature | Red text on blue background | Min | Min continuous operating temperature | Black text on blue background | Low Critical | Damage may occur if operated below this temperature.

In addition the drive may also produce a historical high and low values, these don't need to be filled in by nwipe. I believe they are historical high and low during the lifetime of the drive but then maybe some manufacturers interprit this differently.

So taking the case of a drive that only produces 3 values, the current drive temperature and max and min values. They would be assigned as is leaving Critical and Low Critical unassigned.

If another different make/model produces the current drive temperature and Critical & Low critical then they can be assigned and Max and Min left unassigned.

The GUI code takes care of situations where either critical is missing but max present or max present and critical missing. So it's not necessary to artificially fill in the missing data.

What are the conditions to flag a temp red? is max a drive-vendor provided "warning" level? Highest is imho the highest temp the drive recorded.

If the drive temperature is between max and critical the text turns red. However if one or the other is missing which ever is present is treated as the max temperature and if the drive temperature is above that the text turns red.

In the case above, highest (historical) and lowest (historical) are not actually used by nwipe.

Conclusion

Taking the example below this would show red text. I started and stopped nwipe immediately, so all values read 30 deg.C , the problem being with the value max=30 deg.C, it should either be left uninitialised because the drive doesn't provide it a value slightly less than 65 deg.C. and because the current drive temperature is 30 degree C it is >= Max so generates red text.

So as you are filling max dynamically it should just be left uninitialised and the GUI code will take care of it.

/dev/sde, critical=65c, max=30c, highest=30c, lowest=30c, min=30c, low critical=-40c.
ggruber commented 10 months ago

I introduced the wrong min and max values with commit 68a6002 in get:scsi_temp.c

As all scsi drives I've seen so (including older ones, maybe years ago) only give critical as temp limit, what should we do? I'd prefer setting max to critical -5 as reasonable value. lcrit i set to -40 as i saw elsewhere. For lmin I really have no idea what a reasonable value could be.

IRL the min values should occur only in rare cases. And a guessed max = crit -5 seems much better to me than none.

(I think if drive temps get high the cooling is improper. Made this experience when years ago let a 15k SAS drive run like a SATA drive just on table: tripped (exceeded crit temp). so having no max is almost pointless, you could not protect the drive.

so max = highest is my fault. Would you agree to set it to crit -5? (Don't know where a useful value should com from.)

PartialVolume commented 10 months ago

so max = highest is my fault. Would you agree to set it to crit -5? (Don't know where a useful value should com from.)

You don't really need to provide artificial values if the drive doesn't provide them but if there is no max provided it won't do any harm setting it to critical -5 if you would prefer that.

ggruber commented 10 months ago

pushed a fix to my repo

PartialVolume commented 10 months ago

Looking good. You can do a PR that.

The SSD reporting a hidden sector caught my eye. Looking at the log data, nwipe's code, hdparm and libata all concur that there is one hidden block on that drive. Unless you changed the HPA/DCO on purpose for testing purposes, you may want reset the HPA/DCO on that drive using hdparm before wiping as that hidden sector won't get wiped otherwise. If you reset the HPA/DCO you can always look at the last block of the drive which will now be exposed to see what's in there.

[2023/10/20 09:22:47]  notice: Found /dev/sdh,  ATA-SSD, SanDisk SSD PLUS,  240 GB, S/N=XXXXXXXXXXXXXXX
[2023/10/20 09:22:47]    info: HPA:  max sectors   = 468862128/468862129, accessible max address enabled
 on /dev/sdh
[2023/10/20 09:22:47]    info: HPA values 468862128 / 468862129 on /dev/sdh
[2023/10/20 09:22:47]    info: hdparm:DCO Real max sectors reported as 468862128 on /dev/sdh
[2023/10/20 09:22:47]    info: NWipe: DCO Real max sectors reported as 468862128 on /dev/sdh
[2023/10/20 09:22:47]    info: libata: apparent max sectors reported as 468862128 on /dev/sdh
[2023/10/20 09:22:47] warning:  *********************************
[2023/10/20 09:22:47] warning:  *** HIDDEN SECTORS DETECTED ! *** on /dev/sdh
[2023/10/20 09:22:47] warning:  *********************************
[2023/10/20 09:22:47]    info: func:nwipe_read_dco_real_max_sectors(), DCO real max sectors = 468862128

Screenshot_20231020_081321

ggruber commented 10 months ago

Looking good. You can do a PR on that.

done

ggruber commented 10 months ago

regarding the SanDisk: this is a cheap consumer SSD, don't know about the quality of the firmware. It's cheap. period.

If you don't mind: pls, perform those steps as you want.

PartialVolume commented 10 months ago

Ok, will do, I'll report the hdparm commands I use to reset the HPA/DCO here, just for reference.

ggruber commented 10 months ago

and (just in case): I have a second drive, a twin of that, could/would insert it on request

PartialVolume commented 10 months ago

Unfortunately the drive is responding with a I/O error. This may be because the drive needs to be power cycled.

and (just in case): I have a second drive, a twin of that, could/would insert it on request

Yes, if you could, no rush though as I need to go out now, so won't be working on it this morning.

nwiper@thelia:~/nwipe/min_max_fix/nwipe/src$ sudo hdparm --dco-restore /dev/sdh

/dev/sdh:
Use of --dco-restore is VERY DANGEROUS.
You are trying to deliberately reset your drive configuration back to the factory defaults.
This may change the apparent capacity and feature set of the drive, making all data on it inaccessible.
You could lose *everything*.
Please supply the --yes-i-know-what-i-am-doing flag if you really want this.
Program aborted.
nwiper@thelia:~/nwipe/min_max_fix/nwipe/src$ sudo hdparm --yes-i-know-what-i-am-doing --dco-restore /dev/sdh

/dev/sdh:
 issuing DCO restore command
 HDIO_DRIVE_CMD(dco_restore) failed: Input/output error
nwiper@thelia:~/nwipe/min_max_fix/nwipe/src$ sudo hdparm --dco-restore /dev/sdh
ggruber commented 10 months ago

sdh is powercycled, sdr added

looking for firmware updates I found only a windows tool so far

PartialVolume commented 10 months ago

There is always the possibility that the drive interface on that hardware doesn't support the dco-restore command.

Can you run sudo hdparm --dco-restore /dev/xxx on a ordinary spinning disc on the same hardware?

If the SSD is moved to a standard desktop does sudo hdparm --dco-restore /dev/xxx work on different non server hardware?

Or could be, like you say a cheap drive with dodgy firmware.

PartialVolume commented 10 months ago

I have come across some of these drives where you have to set the HPA first, then do a dco-restore. I'll give that a try.

PartialVolume commented 10 months ago

I ran the following commands but need a power cycle on those two SSD's before the new MAX address takes. Can you pull those two drives out and in. Thanks.

> sudo hdparm -N /dev/sdh
/dev/sdh:
 max sectors   = 468862128/468862129, ACCESSIBLE MAX ADDRESS enabled
Power cycle your device after every ACCESSIBLE MAX ADDRESS
> sudo hdparm -N p468862129 /dev/sdr

/dev/sdr:
 setting max visible sectors to 468862129 (permanent)
 max sectors   = 468862128/468862129, ACCESSIBLE MAX ADDRESS enabled
Power cycle your device after every ACCESSIBLE MAX ADDRESS
> sudo hdparm -N p468862129 /dev/sdh

/dev/sdh:
 setting max visible sectors to 468862129 (permanent)
 max sectors   = 468862128/468862129, ACCESSIBLE MAX ADDRESS enabled
Power cycle your device after every ACCESSIBLE MAX ADDRESS
PartialVolume commented 10 months ago

Yes, I remember now, on some drives you have to set the two values as reported by sudo hdparm -N /dev/sdh to equal one another using the sudo hdparm -N p468862129 /dev/sdh command . Then power cycle the drive, it then accepts a dco-restore without the I/O error.

And in some other makes/models the dco-restore may not be necessary. Other drives don't care about having to set the HPA, you can just do a dco-restore.

It will certainly be a challenge to automate this in nwipe to behave consistently for any make/model of drive. Quite possible, but a challenge.

ggruber commented 10 months ago

Can you pull those two drives out and in. Thanks.

done.

ggruber commented 10 months ago

It will certainly be a challenge to automate this in nwipe to behave consistently for any make/model of drive. Quite possible, but a challenge.

could stop us for a longer time. I suggest implementing one working solution for certain drives, maybe even not that in the upcoming release. And enhance it on demand.

PartialVolume commented 10 months ago

Resetting the HPA didn't work, it's back to 468862128 blocks rather than 468862129.

Could just be a 'feature' of this drive. If it behaves the same on other hardware, I guess it's got to be the drives implementation of the HPA DCO isn't great. Perhaps it hides a block for it's own purposes whatever that might be.

Anyway the code in nwipe is detecting the hidden block correctly so I'll leave it at that.

ggruber commented 10 months ago

what about SSD smart erase? is it on the roadmap for the next release?

PartialVolume commented 10 months ago

I suggest implementing one working solution for certain drives, maybe even not that in the upcoming release. And enhance it on demand.

I think we'll freeze any new features so I get get this release out. I'm happy with what we've done, so I think we are ready for release v0.35 to be published.

ggruber commented 10 months ago

Should I have a (quick) look for a way to sort the disk list?

And what happended to the temp reading thread?

PartialVolume commented 10 months ago

what about SSD smart erase? is it on the roadmap for the next release?

Yes, that is at the top of the list for the v0.36 release

And what happended to the temp reading thread?

I nearly forgot about that ! Thanks for reminding me. Yes that has to be in this release (0.35)

PartialVolume commented 10 months ago

I'll work on the temp thread if you want to take a look at the sorting

ggruber commented 10 months ago

sorted device list is in my repo now, would be glad if someone would test it

@PartialVolume btw, for testing purposes I added another disk, so we have sdaa. please do not wipe it but enjoy it's presence in the list ;-)

PartialVolume commented 10 months ago

Tested, I built it on the server and compared the original listing with the sorted. All looks good. I randomly picked a drive /dev/sdq and checked it's contents (0x74) then started a wipe on just that drive, aborted. Rechecked the contents on /dev/sdq was now 0x00 which it was.

Looks good, as far as I'm concerned you can do a PR on that.

Nice bit of code. Looks like you are sorting the pointers to the contexts based on drive name?

I don't think I'm going to be quite as quick as you with the temperature thread. If I don't get it completed this evening, it may be a couple of days as I'm very busy over the weekend.

ggruber commented 10 months ago

tnx for testing/confirmation, PR started

Nice bit of code. Looks like you are sorting the pointers to the contexts based on drive name?

That's what I did

PartialVolume commented 10 months ago

Temperature thread and sort committed. Looks good, looks like those lags in the GUI have gone. If you could check it out. Thanks.

https://github.com/martijnvanbrummelen/nwipe/assets/22084881/f3995b96-3b39-43ea-a6b4-aeb0c7caf29d