"buffer overflow" crashes on systemrescue

ggruber commented 11 months ago

on systemrescue 9.06 and 10.01 included nwipe both crash with " buffer overflow detected : terminated" when called "nwipe --autonuke --nowait --method=dodshort --logfile=wipe_2023-09-09_14:1846.log /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds" on a dell r720xd same crash with similar call line seen on Supermicro X9 (ivy brigde) system with 44 disks, also systemrescue 10.01

more details or other debug support on request

PartialVolume commented 11 months ago

Thanks for reporting.

Are you able to use the same command but without specifying all the drives, autonuke should enumerate all the drives on the system anyway without specifying them in the command. If it then works without the segfault it will give me an idea where the problem is.

ggruber commented 11 months ago

first: nwipe 0.32 from systemrescue 8.07 worked finally I started 0.34 from systemrescue 10.01: in interactive mode it works. In this special case (no disk in the system that shall not be wiped) I can start with autonuke, noninteractive, as you asked: it starts.

2023-09-10 13_18_20-root@sysrescue

ggruber commented 11 months ago

any idea, why no interface type nor temperature ist schon? same effect with sas disks from other vendors

PartialVolume commented 11 months ago

Temperatures are obtained via the hwmon library which is built into the kernel. It's very likely that hwmon doesn't support this controller.https://github.com/torvalds/linux/blob/master/drivers/hwmon/hwmon.c There are ways around this, for instance by issuing low level commands to the drive however that will only work if the controller allows access to the drive using those commands. I'd need make & model information about the controller to determine whether it's possible. That said, if you can run some other utility that shows drive temperatures then low level access must be possible.

With regards to UNK, the libparted library provides this information and I suspect with this controller it can't determine that information due to either libparted not supporting the controller or the controller not providing that information.

I thought I'd posted a request to you asking to run nwipe in debug mode, but it looks like that post went astray. I'll see if I can find it and repost it here.

PartialVolume commented 11 months ago

Ok, found it.

first: nwipe 0.32 from systemrescue 8.07 worked finally I started 0.34 from systemrescue 10.01: in interactive mode it works. In this special case (no disk in the system that shall not be wiped) I can start with autonuke, noninteractive, as you asked: it starts.

Can I clarify something in the above statement just so I'm clear about this. In system rescue 10.01 which is using nwipe 0.34 when you don't specify all the drives on the command line using the command nwipe --autonuke --nowait --method=dodshort --logfile=wipe_2023-09-09_14:1846.log then it doesn't segfault on starting nwipe and when do specify all the drives on the command line it segfaults? Am I understanding that correctly?

By the way, for those that don't know, when wiping loads of drives there is a shortcut that allows you to select all drives in a single keystroke. On the drive selection screen just type Control A and all drives will be selected for wiping, you can then use the down/up arrows to deselect a drive if so required.

In the case of the CDROM /dev/sr0, you can exclude that from the list by using the command line option exclude=/dev/sr0

With reference to UNK and no temperatures showing up for these drives. Can you run the following command 'nwipe --autonuke --nowait --method=dodshort --verbose --quiet exclude=/dev/sr0 --logfile=debug.log

Once the drives have started wiping, wait a couple of seconds then abort with a Control C. This log file will give me more info about the drives, however as I've used the --quiet option it will exclude drive and system identification data such as serial numbers.

If you could post that logfile below. Thanks.

ggruber commented 11 months ago

the controller is a LSI HBA, lspci says Broadcom / LSI SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03) dmesg contains mpt2sas_cm0: LSISAS2008: FWVersion(20.00.07.00), ChipRevision(0x03), BiosVersion(07.27.01.01)

In the script mentioned in #357 we use just smartctl to obtain the temperature and transport type.

excluding non-disks I thought about filtering devices in a way like lsscsi -g | grep disk

Can I clarify something in the above statement just so I'm clear about this. In system rescue 10.01 which is using nwipe 0.34 when you don't specify all the drives on the command line using the command nwipe --autonuke --nowait --method=dodshort --logfile=wipe_2023-09-09_14:1846.log then it doesn't segfault on starting nwipe and when do specify all the drives on the command line it segfaults? Am I understanding that correctly?

yes

debug.log.gz

PartialVolume commented 11 months ago

Thanks, very useful. I can see hwmon doesn't populate the various fields in /sys/class/hwmon so that's why the temperatures don't appear. Probably worth reporting to the hwmon maintainer, I can do that and cc you in, however, because the maintainer doesn't have this hardware he will need somebody to provide a patch. Even so it's still worth reporting the issue and hardware details so eventually it may get fixed in the kernel.

For the time being, I can utilise smartctl to provide the temperature and transport type.

Is this hardware something you are hanging on too for a while and can do further tests over the next few weeks? I can then make the changes and provide you with a new release of ShredOS for testing to see if temperatures and transport now appear?

ggruber commented 11 months ago

yes, the hardware is available

ggruber commented 11 months ago

fyi: I could add two SATA disks now: their temperatures are displayed but also not the transport type

PartialVolume commented 11 months ago

@ggruber Regarding the segfault, I've just tested with the following command which is as close as I can get to the command you are running but using 27 loop devices, I can reproduce a segfault but only if I try to run nwipe from a directory that does not have write permissions to create the logfile.

For instance, trying to run nwipe with the --logfile options from /dev will produce the error. Running nwipe from /dev without the logfile option works ok.

Is it possible when you get this error you are running nwipe from a directory that does not have write permissions? Also, when you do this this issue does removing the --logfile option then fix the segfault?

nwipe --autonuke --nowait --method=dodshort --logfile=wipe_2023-09-09_14:1846.log /dev/loop16 /dev/loop20 /dev/loop30 /dev/loop31 /dev/loop32 /dev/loop33 /dev/loop34 /dev/loop35 /dev/loop36 /dev/loop37 /dev/loop38 /dev/loop39 /dev/loop40 /dev/loop41 /dev/loop42 /dev/loop43 /dev/loop44 /dev/loop45 /dev/loop46 /dev/loop47 /dev/loop48 /dev/loop49 /dev/loop50 /dev/loop51 /dev/loop52 /dev/loop53 /dev/loop54

ERROR:Unable to create/open 'wipe_2023-09-09_14:1846.log' for logging, permissions?

ERROR:Unable to create/open 'wipe_2023-09-09_14:1846.log' for logging, permissions?

Segmentation fault

PartialVolume commented 11 months ago

So to summarise, to produce the segfault. All these conditions have to be met:

Run nwipe with --logfile option
Run nwipe in a directory that does not have write permissions so the logfile can't be created.
Run nwipe without sudo or root or as as a user without sufficient privileges so you don't have permissions to write to the directory.

I've found the bug that was causing this. I'll commit a fix this evening.

PartialVolume commented 11 months ago

Segmentation fault fixed by #491

PartialVolume commented 11 months ago

@ggruber Regarding the missing temperatures on these SAS drives can you post the output of smartctl -a /dev/xxx and also smartctl -j -a /dev/xxx on one of the drives.

Thanks

ggruber commented 11 months ago

is there a chance to obtain a binary of nwipe with your fix included? (I do not have a dev system available) ?

PartialVolume commented 11 months ago

If you are talking about a binary for system rescue which I believe is based on ARCH Linux, then unfortunately I won't be able to do that as I don't use ARCH. I don't know if anybody else can built you a binary that would work on system rescue 10.01. Nwipe 0.35 will require a extra library file that wasn't required in 0.34 so if that library has not been included in system rescue it may not work.

I'll shortly be releasing nwipe v0.35 which will trigger a new ShredOS release. You could then test with ShredOS

ggruber commented 11 months ago

here are your requested smart informations smartdata.zip

ggruber commented 11 months ago

I'm quite sure also that the logfile could/can be written. Installed a debian bookworm (well, not exactly: it's proxmox pve with it's kernel 6.2.16) now on my testbox: nwipe --autonuke --nowait --method=dodshort --logfile=wipe_2023-10-01_01:11:23.log /dev/sda /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy /dev/sdz buffer overflow detected : terminated

I could install the dev packages also. And you could come to the box if helpful.

ggruber commented 11 months ago

compiled your master code on the system, bug seems to be fixed, indeed. Allthough I doubt that your suggested cause/condition was the real cause on my system: definitly the logfile could/can be written, when the "buffer overflow" occurs here.

so I let nwipe run over night now. and try to find out why control characters (< 0x20) appear on the end of the serial number of these HP disks.

PartialVolume commented 11 months ago

compiled your master code on the system, bug seems to be fixed, indeed.

That's great 👍

and try to find out why control characters (< 0x20) appear on the end of the serial number of these HP disks.

Where are you seeing control characters at the end of the serial number? They don't appear in the GUI wipe screen from your snapshot above. Is it GUI selection screen, logs or summary table ? There is a 20 character limit as defined in the ATA standards and from your snapshot all those HP drives have a serial number with length of 20 characters. When you see control characters is the serial number less than 20 characters or do you see control characters after serial numbers that are 20 characters long?

ggruber commented 11 months ago

kind of opportunity to chat now would be nice. one Logs says e.g. S/N=9XGA0GLK0000C6347EAF^_V

I cloned your git repo and introduced an "isascii()" check in the reading of the serial number, besides a modified "terminating 0" writing, never saw it again. Uninitialised or reused memory? Before the terminating 0 was written hard at index 20, independently of how many characters were read in fact.

PartialVolume commented 11 months ago

here are your requested smart informations smartdata.zip

Thanks for the smart data, I think I can see why nwipe wasn't recognising them as SAS. Should be an easy fix.

ggruber commented 11 months ago

made an enhancement in the GUI as another issue asked like I would to have a little more numeric info (e.g. Bandwidth 1586MB/s instead of 1TB/s, or 1200GB disk size instead of 1TB, too.

PartialVolume commented 11 months ago

Before the terminating 0 was written hard at index 20, independently of how many characters were read in fact.

That would explain it. Are you planning on doing a pull request?

ggruber commented 11 months ago

looking now for way to sort the disk list by device name. Problem is probably the enumeration on my hba which seems quite fuzzy. Maybe nwipe reads sorted after HBA Slots

ggruber commented 11 months ago

And I also see lines like [UNKNOWN] We can't find the HPA line, has hdparm ouput unknown/changed? /dev/sdd in the output, try to investigate

btw, just starting 4 weeks of vacation with opportunities to dive into nwipe. I'd also like to have a option for the path for the PDF files, and possibly the chance to turm them off. like -P [PDF-path], and -P none turns PDF generation off.

PartialVolume commented 11 months ago

And I also see lines like [UNKNOWN] We can't find the HPA line, has hdparm ouput unknown/changed? /dev/sdd in the output, try to investigate

That may be because those SAS drives may not support DCO/HPA like SATA and hdparm returns data that is meaningless to nwipe.

ggruber commented 11 months ago

hdparm -N /dev/sda

/dev/sda: SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0a 00 00 00 00 20 00 01 cf 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 max sectors = 0/208, HPA is enabled

PartialVolume commented 11 months ago

btw, just starting 4 weeks of vacation with opportunities to dive into nwipe. I'd also like to have a option for the path for the PDF files, and possibly the chance to turm them off. like -P [PDF-path], and -P none turns PDF generation off.

Yes, please go ahead with that, if you want to add that option. It's always been in the back of my mind to have that option.

ggruber commented 11 months ago

Before the terminating 0 was written hard at index 20, independently of how many characters were read in fact.

That would explain it. Are you planning on doing a pull request?

sure

PartialVolume commented 11 months ago

hdparm -N /dev/sda

/dev/sda: SG_IO: bad/missing sense data, sb[]: 70 00 05 00 00 00 00 0a 00 00 00 00 20 00 01 cf 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 max sectors = 0/208, HPA is enabled

Yes, that will be why. hdparm isn't getting the correct response from the drive and reports a SG_IO: bad/missing sense data from the drive. With SATA you don't generally see that except for some enterprise drives that no longer support DCO/HPA.

ggruber commented 11 months ago

probably I should do smaller and separated pull requests. Now I mixed already the serial fix with the GUI modifications.

PartialVolume commented 11 months ago

And I also see lines like [UNKNOWN] We can't find the HPA line, has hdparm ouput unknown/changed? /dev/sdd in the output, try to investigate

That will be because either SAS drives don't support DCO/HPA or hdparm is trying to talk to a SAS drive like it's a ATA or quite possibly both.

probably I should do smaller and separated pull requests. Now I mixed already the serial fix with the GUI modifications.

Yes, definitely.

ggruber commented 11 months ago

ok, I'll unwind the two fixes and make separate pull requests.

ggruber commented 11 months ago

I opened an issue for the S/N display problem and created a pull request :-)

PartialVolume commented 11 months ago

I'll open and issue for the S/N display problem (and will have to find out how do a pull request as it will be my first one ever :-) )

We've all been there :-) Are you using a IDE like kdevelop or Eclipse? I tend to use kdevelop for the coding and commits then push the branch containing the changes to my fork, then issue the pull request from the Github web interface. Once the pull request is accepted I delete the branch in my fork. Then delete the branch in kdevelop, branch (checkout) back to master then from the command line git pull upstream master to bring my local copy on my laptop level with the master. I then git push from the command line to bring my fork up to date with upstream master.

ggruber commented 11 months ago

seems I did too much already in one branch. -P option for PDFreportpath is implemented, -P noPDF supresses PDF creation.

I tried to improve the layout of page 2 of the certificate (which I'd prefer to call report, as it could be signed with the help of a certificate. well, simply in my bubble the word "certificate" has a fixed meaning"). Wanted to keep the ascii layout from smartctl by using Courier font. Did not yet understand how the two column layout was created. Could be prettier to have a one column layout which flows to page 3 if needed?

PartialVolume commented 10 months ago

Wanted to keep the ascii layout from smartctl by using Courier font. Did not yet understand how the two column layout was created. Could be prettier to have a one column layout which flows to page 3 if needed?

When I initially wrote the PDF report, it didn't look right as a single column as there was a lot of empty space on the page, the 2 columns just looked better, although I agree about having a fixed width font. I wanted to add a graph on the 3rd page showing MB/s speed during the wipe. I also want to add the option where you can select how many pages you actually want. Some people may only want the first page.

I'm adding a enable/disable option in the /etc/nwipe.conf file that allows you to enable a preview of organisation/customer/date/time prior to the drive selection screen. It will default to no preview but will be useful to verify and change prior to starting a wipe.

As I'm adding that feature I just wondered if you wanted me to go ahead and add the enable disable PDF to the config too, as you mentioned earlier.

ggruber commented 9 months ago

fixed in 0.35

martijnvanbrummelen / nwipe

"buffer overflow" crashes on systemrescue #488