no temperature for SAS drives

ggruber commented 10 months ago

Originally posted by @PartialVolume in https://github.com/martijnvanbrummelen/nwipe/issues/426#issuecomment-1748969487

Yes, via the kernel (hwmon) would be the preferred method which nwipe already supports but it needs somebody with the SAS hardware and, proficient in C and prepared to learn about the low level access to SAS and the standards used. I'm hoping somebody that's looking for a project could add the code to hwmon to support SAS. The maintainer is happy to accept commits but doesn't have the hardware or maybe time to do this himself. If anybody wants to get involved the source for hwmon can be found here. https://github.com/torvalds/linux/blob/master/drivers/hwmon/hwmon.c

Until such time that hwmon supports SAS we may have to use smartctl or alternatively write a low level function ourselves.

ggruber commented 10 months ago

after having fixed the bustype detection I looked for a way to get the drive temperature (and only the drive temperature. smartctl doesn't seem to have an option to get just the temperature. But I found hddtemp which doesn't give me the temps for my SATA drives but for all the different SAS drives. Unfortunately the programm never left the beta status, and bookworm doesn't have it in the repos anymore. So maybe the code gives us a way to include an alternative fetching of temp for SAS drives.

And another question: the interactive display of nwipe seems to refresh quite often, and the load of the machine is the disks being wiped. To ask with max speed via smart for the drive temp seems not the best idea to me. Or is there already a mechanism that makes the screen refresh with e.g. max 5 times per second?

Especially when thing about getting drive temp via smart it seems useful to me to reduce the rate of asking for temps as these do not tend to change within ms. Or am I wrong with this?

PartialVolume commented 10 months ago

Yes, may well be worth looking at hddtemp code for ideas.

Regarding screen refresh, both screen refresh and keyboard response are the same in nwipe. The screen refreshes and keyboard is checked for key hit no faster than 250ms. Any faster and it's unnecessary. Any slower and the user will notice the keyboard to feel less responsive. This is accomplished by these three lines that you will see in most gui functions.

            timeout( GETCH_BLOCK_MS );  // block getch() for ideally about 250ms.
            keystroke = getch();  // Get user input.
            timeout( -1 );  // Switch back to blocking mode.

However the temperature updates although in these loops update far less frequently.

If you look in the gui.c file at the compute_stats function you will find the if statement as shown below. This if statement limits how often nwipe obtains the temperature values from hwmon or from smartctl to only once every 60 seconds. So calling smartctl (if we used that) won't have the same impact on resources as we are not reading temperatures at screen/keyboard refresh speeds.

       /* Read the drive temperature values */
        if( nwipe_time_now > ( c[i]->temp1_time + 60 ) )
        {
            nwipe_update_temperature( c[i] );
        }

The temperature code is quite complicated as it also handles temperature limits of the device and changes the colour and blink rate depending on whether temperature critical upper and lower limits have been reached.

However it would be relatively easy to add additional code for smartctl or hddtemp by adding extra code in temperature.c you wouldn't need to worry about the GUI code.

PartialVolume commented 10 months ago

Basically you just need a function that obtains the drive temperatures from SAS and updates the relevent temperature fields in the drive context structure for each drive. This function would be called if the bus type is SAS and the temperature field in the drive context is null i.e not been found by hwmon. The SAS function would be called from within the nwipe_update_temperature function within the temperature.c file.

ggruber commented 10 months ago

tnx for that great reply. I will try to extract the SAS temp code from hddtemp (it's in two source files) and integrate it into nwipe. As it is also GPLv2 there should be no real issues with the license. For the docs I suggest to add smartmontools as mandatory required package.

PartialVolume commented 10 months ago

I was just looking at the hddtemp code https://github.com/guzu/hddtemp/tree/master/src

I think maybe it's easier to use smartctl. We already use smartctl for various things and already check it exists on the system prior to using it. We note in the logs if it's missing.

In the nwipe code there are already example of obtaining the output from smartctl. Looking at hddtemp it looks too much like a steep learning curve to extract the relevant bits we need and incorporating all of its code into nwipe looks like too much unnecessary code.

So I'd prefer to go with smartctl unless you can write a succinct low level function to access the smart data direct from the drive.

ggruber commented 10 months ago

it's true that the use of smartctl would be easier to implement. But: those HP SAS drives in my current server tend to produce a significant delay. time smartctl -A /dev/sdb shows 0.05s for a SSD drive, but 0.4s, 0.6s for HDDs and 2.7s(!) for those HP SAS HDDs. hddtemp for such a HP SAS HDD takes 0.2s, for other SAS drives 0.005s. Besides that there is a significant overhead for a fork - exec sequence it takes to run another program from nwipe. Fine for one time shots, but in a loop I'd try to avoid that.

So I think it's worth the work to integrate the hddtemp source code.

PartialVolume commented 10 months ago

I agree, those sort of delays would be unacceptably long. Especially the 2.7 seconds, bad enough wiping one drive let alone 28 simultaneously.

So hddtemp looks like a better option. What would be perfect would be a small low level function that just accesses the temperature from the smart data.

I wrote a low level function that is used to obtain hpa/dco max sector information from the drive in much the same way hddtemp and smartctl do. I would expect such a function written like this to extract temperature would be the fastest by far, assuming the drives themselves don't have significant response delays while writing.

If you want to take a look at that low level access to a drive that I use to obtain the hpa/dco real max sectors take a look at the file hpa_dco.c for a function called nwipe_read_dco_real_max_sectors( char* device ) that may help in understanding some of hddtemps code.

I believe SAS use either the SCSI command set or the STP protocol. Maybe this link would be useful too. https://www.scsita.org/sas_and_serial_ata_tunneling_protocol_stp/

Here's the low level function as an example, basically it reads a 512 block using the ioctl command and then from the contents of that block I extract the real max sectors. The block content format is defined in the ATA standards document, which is where you also find the smart data format.

u64 nwipe_read_dco_real_max_sectors( char* device )
{
    /* This function sends a device configuration overlay identify command 0xB1 (dco-identify)
     * to the drive and extracts the real max sectors. The value is incremented by 1 and
     * then returned. We rely upon this function to determine real max sectors as there
     * is a bug in hdparm 9.60, including possibly earlier or later versions but which is
     * fixed in 9.65, that returns a incorrect (negative) value
     * for some drives that are possibly over a certain size.
     */

    /* TODO Add checks in case of failure, especially with recent drives that may not
     * support drive configuration overlay commands.
     */

#define LBA_SIZE 512
#define CMD_LEN 16
#define BLOCK_MAX 65535
#define LBA_MAX ( 1 << 30 )
#define SENSE_BUFFER_SIZE 32

    u64 nwipe_real_max_sectors;

    /* This command issues command 0xb1 (dco-identify) 15th byte */
    unsigned char cmd_blk[CMD_LEN] = { 0x85, 0x08, 0x0e, 0x00, 0xc2, 0, 0x01, 0, 0, 0, 0, 0, 0, 0x40, 0xb1, 0 };

    sg_io_hdr_t io_hdr;
    unsigned char buffer[LBA_SIZE];  // Received data block
    unsigned char sense_buffer[SENSE_BUFFER_SIZE];  // Sense data

    /* three characters represent one byte of sense data, i.e
     * two characters and a space "01 AE 67"
     */
    char sense_buffer_hex[( SENSE_BUFFER_SIZE * 3 ) + 1];

    int i, i2;  // index
    int fd;  // file descripter

    if( ( fd = open( device, O_RDWR ) ) < 0 )
    {
        /* Unable to open device */
        return -1;
    }

    /******************************************
     * Initialise the sg header for reading the
     * device configuration overlay identify data
     */
    memset( &io_hdr, 0, sizeof( sg_io_hdr_t ) );
    io_hdr.interface_id = 'S';
    io_hdr.cmd_len = sizeof( cmd_blk );
    io_hdr.mx_sb_len = sizeof( sense_buffer );
    io_hdr.dxfer_direction = SG_DXFER_FROM_DEV;
    io_hdr.dxfer_len = LBA_SIZE;
    io_hdr.dxferp = buffer;
    io_hdr.cmdp = cmd_blk;
    io_hdr.sbp = sense_buffer;
    io_hdr.timeout = 20000;

    if( ioctl( fd, SG_IO, &io_hdr ) < 0 )
    {
        nwipe_log( NWIPE_LOG_ERROR, "IOCTL command failed retrieving DCO" );
        i2 = 0;
        for( i = 0, i2 = 0; i < SENSE_BUFFER_SIZE; i++, i2 += 3 )
        {
            /* IOCTL returned an error */
            snprintf( &sense_buffer_hex[i2], sizeof( sense_buffer_hex ), "%02x ", sense_buffer[i] );
        }
        sense_buffer_hex[i2] = 0;  // terminate string
        nwipe_log( NWIPE_LOG_DEBUG, "Sense buffer from failed DCO identify cmd:%s", sense_buffer_hex );
        return -2;
    }

    /* Close the device */
    close( fd );

    /***************************************************************
     * Extract the real max sectors from the returned 512 byte block.
     * Assuming the first word/byte is 0. We extract the bytes & switch
     * the endian. Words 3-6(bytes 6-13) contain the max sector address
     */
    nwipe_real_max_sectors = (u64) ( (u64) buffer[13] << 56 ) | ( (u64) buffer[12] << 48 ) | ( (u64) buffer[11] << 40 )
        | ( (u64) buffer[10] << 32 ) | ( (u64) buffer[9] << 24 ) | ( (u64) buffer[8] << 16 ) | ( (u64) buffer[7] << 8 )
        | buffer[6];

    /* Don't really understand this but hdparm adds 1 to
     * the real max sectors too, counting zero as sector?
     * but only increment if it's already greater than zero
     */
    if( nwipe_real_max_sectors > 0 )
    {
        nwipe_real_max_sectors++;
    }

    nwipe_log(
        NWIPE_LOG_INFO, "func:nwipe_read_dco_real_max_sectors(), DCO real max sectors = %lli", nwipe_real_max_sectors );

    return nwipe_real_max_sectors;
}

ggruber commented 10 months ago

wow, cool stuff.

I wrote a low level function that is used to obtain hpa/dco max sector information from the drive in much the same way hddtemp and smartctl do. I would expect such a function written like this to extract temperature would be the fastest by far, assuming the drives themselves don't have significant response delays while writing.

Well, finally we'll see what happens when we test the code on my system. With these lovely HP HDDs ;-)

Regarding your low-level suggestions: I digged in the code tonight, and I'm thinking top-down currectly: introduced that stub for switching to alternate SCSI temp reading in temperature.c. But found, that it will be useful to also implement a kind-of init-function, that would read the trip temperatures and set them to temp1_crit. So you would have a little help for your temperature watching while erasing.

I do not yet understand fully, why you do the complete cycle of trying to read the whole set of "temp1_crit", "temp1_highest", "temp1_input", "temp1_lcrit", "temp1_lowest", "temp1_max", "temp1_min" whenever it's time to "read the temp". Even if you know already the drive does not provide these data. Next: Detection of the hwmon paths could write values to temp1_crit, temp1_lcrit, and you could save the read ops during erasure. (Have to look for the meaning/difference of temp1_highest vs. temp1_max resp. temp1_lowest vs temp1_min yet.)

And: we have the nwipe_context_t_ structure: I suggest to add a flag has_hwmon_data which would at least allow to shrink the log file ;-)

And last, but not least: hddtemp knows/differentiates (at least in the header) between ATA und SATA. Could this be the same as IDE and ATA in nwipe? (Find it just a very little bit confusing to read ATA instead of the expected SATA in the drive overview.)

PartialVolume commented 10 months ago

I do not yet understand fully, why you do the complete cycle of trying to read the whole set of "temp1_crit", "temp1_highest", "temp1_input", "temp1_lcrit", "temp1_lowest", "temp1_max", "temp1_min" whenever it's time to "read the temp". Even if you know already the drive does not provide these data. Next: Detection of the hwmon paths could write values to temp1_crit, temp1_lcrit, and you could save the read ops during erasure. (Have to look for the meaning/difference of temp1_highest vs. temp1_max resp. temp1_lowest vs temp1_min yet.) And: we have the nwipe_contextt structure: I suggest to add a flag has_hwmon_data which would at least allow to shrink the log file ;-)

Yes, the temperature/limits into the log in verbose is excessive to say the least. It's left over from when I wrote the code and have been meaning to do something about it, just never got around it tidying it up. If you fancy cleaning that up please go ahead and also make the addition to the context structure as you suggest.

And last, but not least: hddtemp knows/differentiates (at least in the header) between ATA und SATA. Could this be the same as IDE and ATA in nwipe? (Find it just a very little bit confusing to read ATA instead of the expected SATA in the drive overview.)

I think the problem here is libparted can't tell the difference between SATA and IDE. It identifies them as ATA as that's the SCSI command set they all use. I don't know whether there is a way if differentiating between SATA and IDE? Any suggestions to improving are welcome although I don't want to just change ATA to SATA as IDE drives would be called SATA.

ggruber commented 10 months ago

ok, so we go ahead with ATA.

And I'll try the separation into an init function, that would fill the limit values, and the "normal" readings during wiping.

Stay tuned ;-)

PartialVolume commented 10 months ago

There is a function in temperature.c that's called nwipe_init_temperature who's job is to determine what drives support temperatures and where in the hwmon directory structure the temperature data can be found. Here the temperature limit values are also initialised.

I'm not sure what your init function is going to do but maybe it could fit inside this function? This function is called once at startup.

ggruber commented 10 months ago

exactly there it will be dropped in :-)

PartialVolume commented 10 months ago

@ggruber Just so we don't end up duplicating effort, I'm working on the PDF enable/disable and PDF preview config options on the GUI config page. The values of these will be loaded at startup from nwipe.conf but if the user specifies PDF enable/disable on the command line that will override the value in nwipe.conf.

ggruber commented 10 months ago

fine for me

ggruber commented 10 months ago

Just found that sg3-utils contains a shell wrapping script scsi_temperature which basically calls sg_logs -t <device> Will possibly use it to get the trip temperature aka reference temperature, which is temp1_crit (imho). Reading this trip temperature gave me headaches as this is not included in hddtemp. But I also found the SCSI Commands Reference Manual (from Seagate) which contains the information regarding the temperature page which I was missing.

Ok, I have the trip temp from within the hddtemp code.

ggruber commented 10 months ago

seems I have a first running version with hddtemp code included

2023-10-10

ggruber commented 10 months ago

Is there a special reason to refresh the screen with permanent polling of the temperatures during disk selection? From those lovely HP disk 4 to 5 are gathered per second, and to select the disks is heavily delayed, almost unusable.

Will try to find and fix this. Seems the limit (get temp only once in 60 seconds) is not considered here. Well, it appears to be considered in gui.c Left this as it is for now, added a check in temperature.c:

    int idx;
    int result;

+    time_t nwipe_time_now = time( NULL );
+    if( nwipe_time_now - c->temp1_time < 60 )
+    {
+        return;
+    }
    /* try to get temperatures from hwmon, standard */
    if( c->templ_has_hwmon_data == 1 )
    {

PartialVolume commented 10 months ago

It's looking good. I'm getting a implicit declaration warning, when I compile that needs fixing

hddtemp_scsi/get_scsi_temp.c:80:9: warning: implicit declaration of function ‘scsi_get_temperature’; did you mean ‘nwipe_init_temperature’? [-Wimplicit-function-declaration]
   80 |     if( scsi_get_temperature( dsk ) == GETTEMP_SUCCESS )
      |         ^~~~~~~~~~~~~~~~~~~~
      |         nwipe_init_temperature

Is there a special reason to refresh the screen with permanent polling of the temperatures during disk selection? From those lovely HP disk 4 to 5 are gathered per second, and to select the disks is heavily delayed, almost unusable.

At the time I think it was so if the drive was was too hot, nearing critical you could see the temperature change more quickly in drive selection as you attempt to cool the drive down before starting a wipe. As hwmon is relatively quick it didn't matter but in practise there is no reason really why drive selection shouldn't poll less frequently at one every sixty seconds like drive wiping does.

PartialVolume commented 10 months ago

I was just looking at the drive selection GUI code and there is this bit of code below that refreshes temperature every 1 second. The one second is important for the flashing HPA messages that appear when determining HPA/DCO drive status. So this code shouldn't be changed.

            /* Update the selection window every 1 second specifically
             * so that the drive temperatures are updated and also the line toggle that
             * occurs with the HPA status and the drive size & temperature.
             */
            if( time( NULL ) > ( temperature_check_time + 1 ) )
            {
                temperature_check_time = time( NULL );
                validkeyhit = 1;
            }

However ...

At line 828 of the master code, the following line exists

                /* Read the drive temperature values */
                nwipe_update_temperature( c[i + offset] );

this should really be in a if statement so that it updates every 60 seconds rather than 1 second. Or even a new function that contains the if statment called nwipe_temperature_update_60s as this code is duplicated elsewhere.

I'm surprised there is much of a delay when reading temperatures on SAS drives. How many milliseconds is it taking to read the temperature on each SAS drive?

PartialVolume commented 10 months ago

To measure the execution time of the existing nwipe_update_temperature() function I placed some timing code before and after the function and had it print the execution time to the log.

                // NOTE temporary timing code
                clock_t t;
                t = clock();

                /* Read the drive temperature values */
                nwipe_update_temperature( c[i + offset] );

                // NOTE temporary timing code
                t = clock() - t;
                double time_taken = ((double)t)/CLOCKS_PER_SEC; // in seconds
                nwipe_log( NWIPE_LOG_INFO, "nwipe_update_temperature() took %f seconds", time_taken);

I ran nwipe and left it at the drive selection screen then aborted it after 10 seconds. Here are the function execution times.

[2023/10/10 15:13:31]    info: nwipe_update_temperature() took 0.000091 seconds
[2023/10/10 15:13:33]    info: nwipe_update_temperature() took 0.000231 seconds
[2023/10/10 15:13:35]    info: nwipe_update_temperature() took 0.000316 seconds
[2023/10/10 15:13:37]    info: nwipe_update_temperature() took 0.000172 seconds
[2023/10/10 15:13:39]    info: nwipe_update_temperature() took 0.000204 seconds
[2023/10/10 15:13:41]    info: nwipe_update_temperature() took 0.000216 seconds

So on my 20 core I7 we are talking about anything from 91uS to 316uS, however I have only one drive and it's a NvMe WD Blue SN570 1TB.

Would be interesting if you ran the same test on your multiple SAS drives.

ggruber commented 10 months ago

time smartctl -A /dev/sdb shows 0.05s for a SSD drive, but 0.4s, 0.6s for HDDs and 2.7s(!) for those HP SAS HDDs. hddtemp for such a HP SAS HDD takes 0.2s, for other SAS drives 0.005s.

Only those old HP drives cause real pain.

I'll interrupt the running nwipe, and fill some other disks in the box, among those also NVMe.

I'll integrate your suggestions and tidy up the code a bit. gimme two hours, please. For your amusement you could download https://support.edv2g.de/owncloud/index.php/s/a2mpN5edRHY9TWA

PartialVolume commented 10 months ago

Nice 👍 I like to see lots of flashing lights, you know things are happening. Could also be because I like old 50s & 60s science fiction movies where the computers always had an abundance of flashing lights 🤣

It also demonstrates a 'feature' of nwipe, where it pauses when it's writing as opposed to syncing. This pause is more noticeable when it's doing a PRNG pass and I always thought this pause could be reduced by having the PRNG creation calculated in a thread that was free running writing into a buffer, when the next prng block was needed it had already been calculated, so the wipe thread doesn't have to wait for the next prng block to be calculated. Each wipe thread would need to create a prng thread, so for a 20 drive wipe each having its own thread there would be 20 prng threads. So for nwipe it would running 40 wipe/prng threads plus the gui status thread.

This was one of those things that has been talked about in the past but never got round to optimising that part of the wipe thread.

Its one of those things that one day we will get round to doing but for me adding ATA secure erase is the next priority for me at least.

ggruber commented 10 months ago

Its one of those things that one day we will get round to doing but for me adding ATA secure erase is the next priority for me at least.

that's why I used the chance with the arrival of 12 SATA disk which I want to wipe to chance some more disks. There are now in addition to the 2 Intel SSDs and the 2 SAS SSDs 3 SATA consumer SSDs in the box, plus 2 NVMe SATA plus 2 NVMe PCIe.

Have some trouble yet to get the box up as the NVMe are not blank but contain a former version of the OS which is now installed. ZFS doesn't like to find identical named pools, especially not from initramfs

ggruber commented 10 months ago

here is a logfile with your suggested time measurement code results included. pls nore, that there are only 3 of those super-slow HP drives left over. And: it seems to me there is a difference between update_temperature duration in idle mode and under the load of wiping.

grep nwipe_update_temperature nwipe4.log | awk '{print $6 "\t" $9}' | sort -n -u | less gave me interesting views

nwipe4.log.gz

PartialVolume commented 10 months ago

Those timings look pretty good to me, the worst being 1.360 milliseconds, in percentage terms that means 0.136% of every second is taken up obtaining the drive temperature. Even if writing to the drive stopped during that period, which I don't think it does, but there is always that possibility. Extrapolating that out to the time period of a full wipe, say 4 hrs (240 minutes or 14400 seconds) that means the wipe might take an extra 14400/100*0.136 =19.58 seconds to complete and that's checking the temperature every second.

Would you agree with that?

If I've not made a big mistake in my calculations, then I would say checking temperature at 1 second intervals is fine, however is that what happens in practise? So on your hardware, if you wiped a single drive twice to completion, once with 1 second interval between temperature checks and then repeat the wipe with 60 seconds between checks is there much difference?

PartialVolume commented 10 months ago

Also, the temperature checking is operating in a different thread to the actual disk wipe so on a multi core processor should not delay the wipe thread too much.

PartialVolume commented 10 months ago

Unrelated to the temperature, but thinking about the pause I could see when the drives were writing rather than syncing, I was just wondering, when you have 28 drives wiping and using the PRNG method what is htop -d 1 reporting for core usage, I was just wondering whether the CPU cores are hitting 100% when its calculating the next PRNG block.

PartialVolume commented 10 months ago

I think you are correct about having the smart data as a single column rather than two so it occupies page two and three. I just did a wipe on a disk and the left column text overlaps with the right column in places and the right column gets cropped to the right.

So yes, the two columns needs to be changed to a single column.

report page two smart data shoowing overlapping and cropped text

ggruber commented 10 months ago

that should go to the PDF issue, and fits with my suggestions.

I'll continue with the timing measurements and will intergrate temperature min/max saving for SAS drives.

And I prepared a cleanup function, that closes open fd und frees malloced mem (for in single SAS drive). Still looking for the best place to call it.

ggruber commented 10 months ago

another gui qustion: the firmware of my LSI/Avago/Broadcom HBA distinguishes between SAS, SATA, SAS-SSD und SATA-SSD. I'd find it at least nice, but also kind of helpful during drive selection in GUI if I could spot the SSDs e.g. this way. (but I know about the fragility of the layout)

ggruber commented 10 months ago

what are those red and black highlighted temps in the selection screen?

see https://support.edv2g.de/owncloud/index.php/s/9AjiZFMNqR7yptF

in the last quarter of 20231012_113927.mp4 you see the delayes when one of the lovely HP disks is queried where would time measurement code have to be inserted to catch those delays?

and:

Unrelated to the temperature, but thinking about the pause I could see when the drives were writing rather than syncing, I was just wondering, when you have 28 drives wiping and using the PRNG method what is htop -d 1 reporting for core usage, I was just wondering whether the CPU cores are hitting 100% when its calculating the next PRNG block.

see 20231012_120025.mp4

PartialVolume commented 10 months ago

what are those red and black highlighted temps in the selection screen?

The video and description here describe what those colour changes mean along with what they mean when also flashing.

As those temperatures are normal either the temperature values in the drive context were not initialised when the drive was detected or the values retrieved from the drive are invalid or there is a bug somewhere.

These are the temperature variables in the drive context.

    int temp1_crit;  // Critical high drive temperature, 1000000=unitialised, millidegree celsius.
    int temp1_highest;  // Historical highest temperature reached, 1000000=unitialised, millidegree celsius.
    int temp1_input;  // drive temperature, -1=unitialised. 1000000=unitialised, millidegree celsius.
    int temp1_lcrit;  // Critical low drive temperature, 1000000=unitialised, millidegree celsius.
    int temp1_lowest;  // Historically lowest temperature, 1000000=unitialised, millidegree celsius.
    int temp1_max;  // Maximum allowed temperature, 1000000=unitialised, millidegree celsius.
    int temp1_min;  // Minimum allowed temperature, 1000000=unitialised, millidegree celsius.

At startup they should all be initialised to 1000000, so that if real data is not put in it's place the GUI code knows to ignore changing the colour. In the text above I've used some contradictory wording. 1000000 is the initialised state at start up.

ggruber commented 10 months ago

saw that, searched for the terms, but found no clearly related statement. Improved the min/max handling for SCSI temp in the meantime.

Append a current log from the current commited version nwipe6.log.gz

ggruber commented 10 months ago

mostly the red temperatures are visible in lines with indeterminable HPA/DCO

as SCSI drives do not give highest/lowest temp we should let these values follow the max/min to have a "kind-of-meaningful" value in, shouldn't we?

and I have no good idea where to take lcrit from.

PartialVolume commented 10 months ago

mostly the red temperatures are visible in lines with indeterminable HPA/DCO

Yes, I noticed that too, don't know if it's just a coincidence or not.

as SCSI drives do not give highest/lowest temp we should let these values follow the max/min to have a "kind-of-meaningful" value in, shouldn't we?

Yes.

Not all drives provide all those temperatures, some provide no limit temperatures at all, some provide just the max and minimum operating temperatures, others provide all those temperatures, seems to be down to manufacturer and model.

So if no specific lcrit data is available it should be just left in it's initialised (unwritten) state = 1000000 so the GUI code just ignores it.

PartialVolume commented 10 months ago

in the last quarter of 20231012_113927.mp4 you see the delayes when one of the lovely HP disks is queried where would time measurement code have to be inserted to catch those delays?

To me that freeze in the last quarter looks like the gui wipe status update code which is running in it's own thread freezes because maybe it's sharing a core with a wipe thread. From htop I can see you have 40 cores and 51 threads running so some threads are sharing cores. Maybe that has something to do with it. If that were the case reducing the number of drives by half would either reduce the problem or make it disappear.

Also it was early in the DOD short wipe, the PRNG pass doesn't happen until pass three, so it's nothing to do with PRNG generation. But in htop I do see some cores hit 100% and above momentarily.

Very odd, something I've only ever seen in a PRNG pass when in writing and not for the length of time you are seeing.

Do all the disc lights stop flashing when the freeze occurs on screen? I'm guessing probably not.

The timing code where you had it before around the temperatures update if the numbers look good i.e no greater than 1.3ms then you can rule out a drive that is taking a long time to respond.

I do wonder if it's simply a CPU scheduling issue.

PartialVolume commented 10 months ago

Regarding the threads in nwipe, there is the main thread i.e nwipe.c and then from here there is one wipe thread created for each drive and the GUI update thread (A thread that calculates all the various values displayed during the wipe). All these threads are created with equal priority.

I was just wondering if setting a higher priority for the GUI thread over the wipe threads might make the problem disappear. Something I've never done.

Without looking at the code there could be two GUI related threads, one for the general ncurses display and one that obtains the values that are displayed.

The spinner at the end of the line may be a clue as I think that just in the main GUI thread.

You could also try changing the priority of the whole of nwipe with nice to see if that makes a difference. nice range -20 to +19, -n is an adjustment to the current nice value shown by typing nice in the terminal. So for my system ..

~$ nice
0
 then

Max priority to nwipe 
nice -n -20 nwipe

Min priority to nwipe
nice -n +19 nwipe

default is 10 according to the man pages but is 0 on my system.

Maybe that will give priority to nwipe over the OS caching and writing the data to disc.

ggruber commented 10 months ago

don't think we need to nice. As those 12 SATA drives are wiped now I can swap in those HP SAS drives again. Then it's much more visible that the temp getting takes time/blocks.

PartialVolume commented 10 months ago

another gui qustion: the firmware of my LSI/Avago/Broadcom HBA distinguishes between SAS, SATA, SAS-SSD und SATA-SSD. I'd find it at least nice, but also kind of helpful during drive selection in GUI if I could spot the SSDs e.g. this way. (but I know about the fragility of the layout)

Something for the next version, I always wanted to be able to toggle two different drive selection views, the alternate being something like this which could incorporate more info ..

Maybe an alternate wipe view too where more info is displayed.

ggruber commented 10 months ago

as I want to avoid selecting SSDs for my nwipe test easily I added the is_ssd flag and show it in GUI. Lines can be up to 85 chars now, which is not good. What about omitting the "/dev/" strings for the devices? And the n1 for the NVMe's ?

And then I added msec timemeasurement code in temperature.c

Shows really interesting results: nwipe9.log.gz

there are new videos in my cloud

ggruber commented 10 months ago

will be afk over the weekend, so I start a pull request

ggruber commented 10 months ago

updated log + video for the run that startet yesterday in the evening on cloud: looks unexpected, as all drives seem the pause at the same time. Does the GUI update (that requests temperature reading) block all threads while waiting for one of the slow HP disks? and: temperature reading take significant longer when the disks are under load

If I better knew tabcalc: based on egrep 'get temperature for.*took' nwipeA.log | awk '{print $9 "\t" $7}' | sed -e 's/\.[0-9][0-9][0-9][0-9][0-9][0-9]//' we could do some nice statistics per drive: min/max/medium/mean times per drive.

Medium column B is the medium temperature reading time in ms

Should we decouple temp reading and gui update?

ggruber commented 10 months ago

@PartialVolume : what do you think about sorting the disk list? and: group disk list by non-SSD / SSD instead of showing e.g. SATA-SSD ? Potentially the SSDs are wiped in a completely different way than non-SSDs. We could even have a cli option include-ssd (or exclude-sdd)

PartialVolume commented 10 months ago

as I want to avoid selecting SSDs for my nwipe test easily I added the is_ssd flag and show it in GUI. Lines can be up to 85 chars now, which is not good. What about omitting the "/dev/" strings for the devices? And the n1 for the NVMe's ?

This would need to be submitted as a single pull request or opened up in discussions so hopefully other people will comment. Over the years /dev/ has been removed and re-instated. Some people like it some don't. So I'm reluctant to keep changing it without a good reason. As to whether the 80 character limit is really necessary to stick to nowadays I'm not so sure. Take ShredOS it once operated without using the DRM (Linux's Direct Rendering Manager) At the time it meant the default column width was 80 but with DRM it uses native screen resolution and provides much larger column widths. Having said that using nomodeset will cause ShredOS to not use DRM, which is often used when there are display driver issues. In this mode 80 columns is the default.

I like the idea of SATA-SSD but maybe it should be coded so that it depends on the current column width. If the current column width is 80 or less [SATA] is shown, if column width is greater than 84 then [SATA-SSD] is shown. The GUI adapts the content depending on screen width. I've used column width checking a lot in the GUI code which is why you can resize the terminal and the display is immediately updated.

PartialVolume commented 10 months ago

Does the GUI update (that requests temperature reading) block all threads while waiting for one of the slow HP disks?

No, it only blocks the GUI Update thread but by doing so nothing in the display will get updated so it looks like everything froze, but it didn't.

Should we decouple temp reading and gui update?

It does look like having the temperature update code running in it's own thread would solve this problem. It wouldn't momentarily hang the GUI thread every time it obtained the temperatures. This change would need to be submitted on it's own as a single pull request. I think you are submitting direct from your master but you probably want to separate specific changes like this by creating a branch and then generating the pull request from a branch.

Those SAS drive temperature delays are definitely bad at 1800+ ms but with being called from a new dedicated thread that just calls nwipe_update_temperature( nwipe_context_t* c ), I'm pretty sure would fix the momentary freeze in the GUI.

PartialVolume commented 10 months ago

@ggruber Just thinking out loud about the temperature thread and I've made the assumption you would be happy to create the thread? If not I can do it. I know you don't really need these pointers but thought I'd just mention a couple of things to do with the threads, especially related to thread request cancellation, which once upon a time nwipe didn't have which let to thread resources not getting returned properly on exit.

I would probably leave nwipe_update_temperature( nwipe_context_t* c ) as it is and create a new function that cycles through each drive in turn calling nwipe_update_temperature( nwipe_context_t* c ). This wrapper function would repeatably call nwipe_update_temperature( nwipe_context_t* c ) for each drive but it should never call the nwipe_update_temperature( nwipe_context_t* c ) function any faster than once per second. Important for when nwipe is wiping a small number of SATA or NvMe drives that respond much faster ( 1-200 ms). This avoids the possibility of nwipe_update_temperature_wrapper being called a thousand times a second and the core hitting 100% and wasting CPU cycles.

The wrapper function

The wrapper function would be something like this, which I'll call nwipe_update_temperature_wrapper and include it in temperature.c

while ( terminate_signal !=  1 )
{
    for ( each_drive)
    {
            If ( elapsed_time_in_seconds < 1 )
            {
                    sleep 1
            }
            nwipe_update_temperature( c[i] )
    }
}

Remove the three other calls to nwipe_update_temperature(..) as shown below. This function should not be called from anywhere else except our new wrapper function nwipe_update_temperature_wrapper.

Screenshot_20231014_153425

Also the nwipe_log_drives_temperature_limits call needs to be moved from nwipe.c into the temperature thread, it should be called only once for each drive.

Screenshot_20231014_174935

The other additions would be in nwipe.c as shown below:

The thread creation code

The code below would be placed in nwipe.c following the gui thread creation at around line 612

nwipe_temperature_data.c = c2;

// I'm not sure whether the line below is actaully needed.
nwipe_temperature_data.nwipe_misc_thread_data = &nwipe_misc_thread_data;

 /* Fork the temperature update thread. *
errno = pthread_create( &nwipe_gui_thread, NULL, nwipe_update_temperature_wrapper, &nwipe_temperature_data 
        // some error checking code
);

The thread cancellation code

The thread cancellation request code where you join the thread and request it to cancel and wait for acknowledgement that it has terminated . This would be pretty much the same as the GUI thread cancellation code and that can be used as an example. The nwipe_update_temperature_thread can be placed next to the GUI thread cancellation code, before or after as it doesn't matter. This would be placed approximately at line 667 in nwipe.c

if( nwipe_update_temperature_thread )
{
        if( nwipe_options.verbose )
        {
            nwipe_log( NWIPE_LOG_INFO, "Cancelling the temperature_update thread." );
        }

        /* We don't want to use pthread_cancel as our temperature update thread is aware of the control-c
         *  signal and will exit itself we just join the GUI thread and wait for confirmation
         */
        r = pthread_join( nwipe_update_temperature_thread, NULL );
        if( r != 0 )
        {
            nwipe_log( NWIPE_LOG_WARNING, "main()>pthread_join():Error when waiting for nwipe_update_temperature_thread to cancel." );
        }
        else
        {
            if( nwipe_options.verbose )
            {
                nwipe_log( NWIPE_LOG_INFO, "nwipe_update_temperature_thread has been cancelled" );
            }
        }
 }

ggruber commented 10 months ago

back from my journey.

This would need to be submitted as a single pull request or opened up in discussions so hopefully other people will comment. Over the years /dev/ has been removed and re-instated. Some people like it some don't. So I'm reluctant to keep changing it without a good reason.

I could implement a CLI Switch to have /dev or not, with the opportunity to set it in the config file.

The GUI adapts the content depending on screen width. I've used column width checking a lot in the GUI code which is why you can resize the terminal and the display is immediately updated.

wilk try to implement this

I think you are submitting direct from your master but you probably want to separate specific changes like this by creating a branch and then generating the pull request from a branch.

will have to learn to do it better. tnx for your suggestion = help

I've made the assumption you would be happy to create the thread?

Sure I would. But my time deposit for nwipe will not be as comfortable as during the last two weeks during the next two. To be quiet honest I really programmed in C 30 years ago, since then mostly scripting languages like Perl. And I do not yet use an IDE for programming. So please do not hope to get this threading thing done during the next two weeks.Not implemented by me. Not sure how much influence it has for typical nwipe usage. Do we have any kind of telemetry information, what a typical use includes? Well, with those reports/"certificates" the use could change to more professional usage.

(just some raw thoughts)

PartialVolume commented 10 months ago

And I do not yet use an IDE for programming.

A long time ago, probably back in the 80s I used to write C just using notepad. I even remember it was a thing to say this program was written using notepad. Just a plain old text editor, like Vim or vi if your old school. Then at some point as Linux become more stable I started using Kdevelop, for any big project I can't imagine using anything else now. Although it's nice tabbing between source files for me the single most used features are, 'show uses' where it will list a variable or function you choose and where it may be used/called throughout all your source files, information screens about function usage for libraries you may not be familiar which saves you Googling it. And of course the inbuilt git functionality for the main commands like commit, checkout, branching, stash management etc. Single click build plus many other features. It has a learning curve but worth it in the end especially for large projects. It also has excellent debugging tools although I rarely need to use those nowadays.

PartialVolume commented 10 months ago

So please do not hope to get this threading thing done during the next two weeks.Not implemented by me.

I can put it in a thread and test it, probably would take me a hour if that. I can let you know when it's done ( possibly tomorrow) and you could test it if that' ok. I'll commit your existing PR first.

PartialVolume commented 10 months ago

Not sure how much influence it has for typical nwipe usage. Do we have any kind of telemetry information, what a typical use includes?

No telemetry at the moment, something to consider for the future maybe, depends how people feel about that. It would be useful though Imho.

I think there is quite a few running nwipe or ShredOS on multidisc SATA servers not sure how many are SAS drives though.

martijnvanbrummelen / nwipe