do not abort on write error, and, thousands separation

mdcato commented 7 months ago

@PartialVolume,

I've embarked on some changes but wanted to get your thoughts before forking.

I'm changing the logic in pass.c/nwipe_static_pass() (others later) to NOT abort on a write() or fdatasync() error. My thought being that if the rest of the drive, after getting past the error-prone area(s), can be wiped, the more secure the drive will be (even tho it's likely to be physically destroyed). There's not much time lost as, most likely, other drives will continue being wiped in the same run. This could be controlled by a command line option if it's thought not everyone would want this behavior change.
I've also changed some of the calls to nwipe_perror() and nwipe_log() to show the current offset where the write() or fdatasync() failed, so the operator can know how large an area is affected, and how many, and have changed the printf formats to "%'llu" so that the long offset numbers are comma-separated. I've also changed those log calls to "Warning" instead of "Fatal" since it will continue on wiping the rest of the drive. The '-flag change also requires the addition of including and 'setlocale( LC_All, "")' in nwipe.c/main(); the thousands-separator will thus be locale-specific.

I don't think the above ideas are too controversial, however I haven't look at the build environment for ShredOS in regard to #2 to see if you're using a different lib (for smaller size?) that may not be up-to-date (i.e. older than POSIX.1-2008 or Single UNIX Specification) that doesn't support the '-flag in the format string. I'm assuming you're using recent/latest tools, but worth asking.

At this stage, I'm only changing calls to the log to have the "%'llu" format specifier as the nwipe GUI space is more restricted and you already have large numbers converted to "10 T" for example. The log needs a more precise number if you're trying to determine how large an area the error is occurring.

I have not explored whether these changes, especially not causing a fatal error, would affect the PDF report.

Let me know if you have suggestions/reservations/exclamations/cautions/etc on this experiment.

[Humor] The hard part of this experiment is getting known-bad drives to fail "reliably"; I've been sandwiching them with other drives to build up heat; no air flow over them.

Firminator commented 7 months ago

1) sounds like a good improvement PartialVolume has been thinking about already in the past somewhere here. Although it was more along the line if bad sectors are encountered to stop the current wipe direction and start wiping from the end of the drive... or skip like 100 blocks and continue the wipe. I'm all for it and if you can offer coding help and collaborate with him that would be even more awesome.

This is a real-world problem since we usually wipe (soft) failing drives (as in SMART value for remapped sector met a treshold and triggered a SMART warning which then triggers an alert in whatever storage system OS is in use) before returning the drives for warranty replacement.

mdcato commented 7 months ago

I remember the idea of restarting from the end, but just continuing on forward is simpler logic and I believe gives the same result. It also handles the case of multiple bad areas.

(Sent from mobile)

From: Firminator @.> Sent: Wednesday, February 21, 2024 8:43:42 PM To: martijnvanbrummelen/nwipe @.> Cc: Mike Cato / Hays Technical Services @.>; Author @.> Subject: Re: [martijnvanbrummelen/nwipe] do not abort on write error, and, thousands separation (Issue #550)

sounds like a good improvement PartialVolume has been thinking about already in the past somewhere here. Although it was more along the line if bad sectors are encountered to stop the current wipe direction and start wiping from the end of the drive... or skip like 100 blocks and continue the wipe. I'm all for it and if you can offer coding help and collaborate with him that would be even more awesome.

This is a real-world problem since we usually wipe (soft) failing drives (as in SMART value for remapped sector met a treshold and triggered a SMART warning which then triggers an alert in whatever storage system OS is in use) before returning the drives for warranty replacement.

— Reply to this email directly, view it on GitHubhttps://github.com/martijnvanbrummelen/nwipe/issues/550#issuecomment-1958558045, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ANGK2PSKGZMI47U35DPGOHTYU2WF5AVCNFSM6AAAAABDUA7H4GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJYGU2TQMBUGU. You are receiving this because you authored the thread.Message ID: @.***>

Firminator commented 7 months ago

I found the comment @ https://github.com/martijnvanbrummelen/nwipe/issues/497#issuecomment-1789703828

Reverse wipe, Instead of wiping start to end of disc, it wipes end to start of disc . Useful when you have a drive with bad sectors near the start (as is often the case) and you want to make sure as much as the workable parts of the drive are wiped.

so yeah this was on the boilerplate already.

Also found the old thread that had different idea/approach with non-linear wiping @ https://github.com/martijnvanbrummelen/nwipe/issues/10

PartialVolume commented 7 months ago

@mdcato By all means please fork. I have a box full of disks that all fail in weird and wonderful ways to test your code.

@Firminator is correct, my preference is that the first I/O error that occurs triggers a reverse wipe. This whole discussion also is closely related to why for sequentially writing a block device using linux disc cacheing is ok (but not great) for wiping a disc but is not want you want when dealing with discs with I/O errors.

If you take a system with 16GB CPU memory and as an example start the wipe on one disc. Very quickly the memory will fill up with about 12GB of cached writes to the disc. We periodically flush the cache to detect a error, you can't detect the error with the write because it's not direct I/O. However when we issue the fdatasync to check the drive is working correctly the fdatasync won't return until the entire 12GB is flushed to disc (hence why direct I/O will be faster by about 5-20%). However, say the fdatasync detects an I/O error and returns we don't know the actual block that caused the I/O error. Only that it was somewhere in the last 12GB of data that was written. So now we have to go back x blocks to a block that we really don't know the number of to try to find where the bad block actually is and perform a single 512 byte block write followed by a fdatasync to flush the block and detect the error. If that wasn't complicated enough some drives don't even fail nicely and instead cause fdatasync to not return causing the thread to hang.

This is why I've always wanted to move away from using the linux disc cache in nwipe and instead perform direct I/O with the disc, so that nwipe has total control over the disc access. In direct I/O the disk write provides us with a error if the block write fails. Nwipe would write a block of 200K bytes ( I think the ideal block size for speed was discussed in the past) and if it failed we know exactly where to start the process of 512 or 4096 block writes to close in and locate the bad sector.

I did start a direct I/O branch, which I've kept private as it needs more work but it would take care of trying to get passed bad blocks either by writing a single block until it could get passed the bad section or preferably doing a reverse wipe as this would wipe out the bulk of the drive as fast as possible until it reached the bad sectors again but from the other end.

I don't think the above ideas are too controversial, however I haven't look at the build environment for ShredOS in regard to https://github.com/martijnvanbrummelen/nwipe/issues/2 to see if you're using a different lib (for smaller size?) that may not be up-to-date (i.e. older than POSIX.1-2008 or Single UNIX Specification) that doesn't support the '-flag in the format string. I'm assuming you're using recent/latest tools, but worth asking.

Yes, ShredOS uses all recent libraries & tools.

A I/O error, if you continued wiping as much of the disc as possible would also affect the verification which would fail and the PDF report which should also show failed. In addition the PDF shows the actual number of bytes successfully wiped at least once, this value would need to be calculated correctly when either writing through the bad blocks or reverse wiping after a I/O error.

[Humor] The hard part of this experiment is getting known-bad drives to fail "reliably"; I've been sandwiching them with other drives to build up heat; no air flow over them.

I do have one or two failed drives that fail within minutes of starting the wipe, which is really handy. It would be a pain if all the faulty test drives failed after x number of hours, but then you could simulate a failed block in code to test the code to a certain degree.

PartialVolume commented 7 months ago

I've added three features to my project list, I'll give them a priority once I get through the priority zeros and ones. https://github.com/users/PartialVolume/projects/1/views/1

gorbiWTF commented 7 months ago

With reverse wiping, what would happen if there are multiple non-consecutive bad sectors? Wouldn't this approach leave potentially good sectors in the middle not wiped?

PartialVolume commented 7 months ago

No, reverse wipe would continue to write through bad sectors to a point where it aborted on the forward wipe. In practise as trying to write through bad blocks means single 512 or 4096 byte block writes writing through potentially bad blocks will drastically slow transfer speed. So the write method would switch from saying 100,000 byte writes for maximum transfer speed to single block writes when it detects I/O errors then back to a large block write if X number of blocks transferred correctly. This would be experimental so may well change based on what we find happens in practice.

I would imagine for those that are wiping hundreds or thousands of drives for resale the slow down in speed would be a waste of time, so as what currently happens, if the drive has I/O errors or reallocated sectors the drive is pulled and physically destroyed as it would be unsuitable for resale or use.

Forward wipe followed by reverse wipe on error with write through bad blocks would probably work for somebody that just want to wipe as much as possible but doesn't want to physically destroy the platters but just place in electrical waste. Time is maybe not an issue for them.

martijnvanbrummelen / nwipe

do not abort on write error, and, thousands separation #550