[Request] Provide a cut-down alternative script designed to run from Cron

ngandrass / truenas-spindown-timer

Monitors drive I/O and forces HDD spindown after a given idle period. Resistant to S.M.A.R.T. reads.

MIT License

265 stars 26 forks source link

[Request] Provide a cut-down alternative script designed to run from Cron #26

Open Sophist-UK opened 10 months ago

Sophist-UK commented 10 months ago

i.e. remove the loop from the existing spindown_timer.sh script and do a one-time check and spindown if needed when the script is run.

Then you create a cron job(s) as required to do a spindown check - and you have all the flexibility of cron to e.g. run the script at 0,10,20,30,40,50 minutes past the hour for hours between 8am and 6pm, only on weekdays.

ngandrass commented 9 months ago

Thanks for your suggestion!

Just to make sure, that I understand you correctly: You'd like to have a mode in which the script is run once, waits for the given amount of seconds according to POLL_TIME, spins down inactive drives and exists right after?

So, for example: A script invocation with -p 600 would monitor I/O for 600 seconds and exit.

Sophist-UK commented 9 months ago

No - I want a script that immediately checks to see if the disks have been inactive for a specified time, spin them down if they have been, and then exit.

In other words it doesn't stay running in the background, but instead we poll the disks by executing a cron job at each poll time.

Then I can have different cron jobs for different periods with different timeouts.

Sophist-UK commented 9 months ago

As per my offer at the end of another issue, I will be happy to enhance the installation instructions for both the original script and this cron version once it has been written and I have updated it.

I should add that I am still unclear what the difference is between this original script and the standard TrueNas Scale spindown functionality in the Disks part of the UI (providing that I tweak SMART service settings to use -n standby)????

Obviously the standard TrueNas functionality is a single standard setting that does not vary by day or time of day - hence my desire for a cron job. I have photovoltaics providing power during the day - so I want to keep the disks spinning during daylight and not add to wear and tear by spinning them down. But at night, after we have finished watching plex I would like them to spin down.

ngandrass commented 9 months ago

No - I want a script that immediately checks to see if the disks have been inactive for a specified time, spin them down if they have been, and then exit.

This is, unfortunately, not possible. The script uses iostat to actively monitor I/O. On the start of each poll interval, the script spawns an iostat instance that monitors disk I/O for POLL_TIME seconds. After this, the spindown script evaluates the I/O report and eventually spins down idle disks.

If the iostat loop is not running, no historical information about disk I/O is available. Hence, the decision of whether a disk has been idle for a specified amount of time cannot be made.

Then I can have different cron jobs for different periods with different timeouts.

If this is your only goal, you can already achieve this right now. Please refer to README.md - Using separate timeouts for different drives for more details :)

I should add that I am still unclear what the difference is between this original script and the standard TrueNas Scale spindown functionality in the Disks part of the UI

The differences are, that this script should spin down drives regardless of S.M.A.R.T.. However, with SCALE there currently exists an issue, where the drive temperature reading keep the drives spinning regardless. See #14 for more details :)

I have photovoltaics providing power during the day - so I want to keep the disks spinning during daylight and not add to wear and tear by spinning them down. But at night, after we have finished watching plex I would like them to spin down.

Oh, I get what you are trying to achieve now ... Two possible solutions come to my mind:

Create a "one shot" mode as I suggested in my first reply. You would execute the script via a cronjob, it would wait for POLL_TIME seconds, then decide if disks are idle or not, and exit. You can set POLL_TIME to any positive value you like. However, I'd suggest not going below 300 seconds since this could result false detections of drive activity state (e.g., when caches buffer some disk writes while watching a movie).
Integrate an additional argument that allows to specify a time frame in which disks should be kept spinning regardless of their I/O activity. This could get a little fiddly, but is, in my opinion, the cleaner approach to this problem.

What do you think?

Sophist-UK commented 9 months ago

No - I want a script that immediately checks to see if the disks have been inactive for a specified time, spin them down if they have been, and then exit.

This is, unfortunately, not possible. The script uses iostat to actively monitor I/O. On the start of each poll interval, the script spawns an iostat instance that monitors disk I/O for POLL_TIME seconds. After this, the spindown script evaluates the I/O report and eventually spins down idle disks. If the iostat loop is not running, no historical information about disk I/O is available. Hence, the decision of whether a disk has been idle for a specified amount of time cannot be made.

Ah!!!! In that case, yes we need to have the above option. I wouldn't want to run the cron job any more frequently than the poll time anyway, so this sounds fine. But even if I did want e.g. a 10min idle time and absolutely no more than 15 mins idle time, it would be nice - but not essential - if I could run the cron job every 5 mins with a 10 min poll.

Then I can have different cron jobs for different periods with different timeouts.

If this is your only goal, you can already achieve this right now. Please refer to README.md - Using separate timeouts for different drives for more details :)

No, I only have one pool of spinning disks, and I want to have different timeouts for different times of day.

I should add that I am still unclear what the difference is between this original script and the standard TrueNas Scale spindown functionality in the Disks part of the UI

The differences are, that this script should spin down drives regardless of S.M.A.R.T.. However, with SCALE there currently exists an issue, where the drive temperature reading keep the drives spinning regardless. See #14 for more details :)

I have not tested this myself - as per the last comment on #14, I previously tried it with -n never and although I have now set -n standby I have not yet tested this because I am transferring existing data to the NAS 24/7.

I have photovoltaics providing power during the day - so I want to keep the disks spinning during daylight and not add to wear and tear by spinning them down. But at night, after we have finished watching plex I would like them to spin down.

Oh, I get what you are trying to achieve now ... Two possible solutions come to my mind:

Create a "one shot" mode as I suggested in my first reply. You would execute the script via a cronjob, it would wait for POLL_TIME seconds, then decide if disks are idle or not, and exit. You can set POLL_TIME to any positive value you like. However, I'd suggest not going below 300 seconds since this could result false detections of drive activity state (e.g., when caches buffer some disk writes while watching a movie).

As above, this would be fine. We would probably have to confirm that the pre-amble commands to work out what disks to monitor don't spin up disks that are already spun down.

Integrate an additional argument that allows to specify a time frame in which disks should be kept spinning regardless of their I/O activity. This could get a little fiddly, but is, in my opinion, the cleaner approach to this problem.

What do you think?

I think that all you would need here would be an parameter to specify an end-time and/or one to specify an elapsed period - rather than a time-frame for the disks spinning. So I run a cron job at 10:50pm which checks every 10 minutes for disks not having been used for 10 mins and spools them down (starting no earlier than 11pm).

But if the preamble commands don't spinup the disks, then I think this is much more fiddly than having a "one shot" script that you can cron to happen every 10 mins between 10:50pm and 8am.

(TBH it is just as easy in TrueNas to create a Cron job than to create a Post-Init script. So if the cron script has no side effects and if you wanted to have a single script and not have to maintain two, then a cron-based script would be preferable.)

ngandrass commented 9 months ago

Ok, so providing a "one-shot" mode seems to be our simplest and most easy to understand solution.

I'd integrate an -o option to achieve this exact behavior.

ngandrass commented 9 months ago

I integrated the one shot mode into the develop branch: https://github.com/ngandrass/truenas-spindown-timer/tree/develop

The README.md is updated accordingly. See One shot mode [-o] for more details. Please verify that this suites your needs :)

Sophist-UK commented 9 months ago

It may take me a day or two to try it. 😃 But thank you so much for doing this and so fast. 👍 👏 🥇

Sophist-UK commented 9 months ago

A quick review of the commits you made to the develop branch:

-l Do we need another switch so that if we use -l we can turn on stdout as well? I am not sure whether this would be useful or not. But I suspect that if you are using -d then you might want both stdout and logging output. Also, if we can detect whether we are running with some sort of console (SSH, TrueNas web UI shell) or in the background (Post-Init or Cron) I think -l should be default.
-l Logic is duplicated in the log_verbose() function - it can be simplied by calling log().
Change log needs a further update for -o.

Sophist-UK commented 9 months ago

@ngandrass Niels

I am testing this now. But it appears that I didn't understand how it works before - I was assuming that the o/s could tell you how long since the last i/o because for its own spindown functionality it tracks this data. But I see now that the script polls iostat for a period of (say) 10mins to measure the i/o and if it is zero it decrements the (say) 1hr timer and if not it resets it. If the timer gets to zero it spins down the drives. So after a rolling sequence of 6x 10min periods of no I/O it spins down the drives.

So for one-shot you have to have poll and timeout the same and you cannot have a rolling period.

This is not quite what I was expecting - it's my fault for not having understood the existing code before I asked, and so I gave you poor input and we do in fact need an alternative approach.

It seems to me that to get the functionality I have in mind, we probably need two separate scripts:

A script that runs permanently in the background and does (say) 300s polls and keeps track of the number of seconds since the last I/O - providing functionality that the o/s doesn't expose.
Another script that runs on cron and where you specify the spindown timeout and which checks the values of the first script against the target and does a spindown if required.

I am not sure whether this is actually achievable without writing data to a file - but this could be in /tmp which is an in-memory file system. However you would probably need to account for script 1 being part way through writing the file when script 2 executes - or you write to a new file and then delete the old file and rename the new one. You might also need to have some way of checking that script 1 hasn't crashed and the file is very stale by e.g. putting a time-stamp in it and checking that or checking the file creation/modification time.

Overall this doesn't seem that hard to achieve however I am really not expert enough in bash to do this myself, but if you don't have the time I can give it a bash [sic.].

A couple of very minor additional comments on the existing code:

The first line of iostat output is the stats since boot - but on my TrueNas scale system zpool iostat reads and writes since boot don't seem to change over time. Both iostat and zpool iostat can take a -y flag to omit the stats since boot - you will need to reduce the count to 1 and not tail to remove the first line as well if you use this.
Whilst you are in development mode, any chance that the -c option (or a new option) could also summarise the iostat output to give some idea of either how many I/Os were done in the x second period or alternatively how long it has been idle?

ngandrass commented 9 months ago

-l Do we need another switch so that if we use -l we can turn on stdout as well? I am not sure whether this would be useful or not. But I suspect that if you are using -d then you might want both stdout and logging output. Also, if we can detect whether we are running with some sort of console (SSH, TrueNas web UI shell) or in the background (Post-Init or Cron) I think -l should be default.

I guess the majority of people is fine with the script logging to stdout and stderr. The -l option is more or less a convenience method. If you require a more specific setup you could simply drop the -l flag and redirect stdout and stderr to your desired locations.

Example: ./spindown_timer.sh -v -m zpool -i datapool 2>/var/log/spindown_timer.error.log 1>/var/log/spindown_timer.log

In theory we could try to detect the invoking process but that is beyond scope IMHO.

-l Logic is duplicated in the log_verbose() function - it can be simplied by calling log().

That's a good suggestion. I've initially decided against it for the one liner function but it makes sense now. I've pushed an update to the develop branch. Thanks :)

Change log needs a further update for -o.

Could you make that more precise? The CHANGELOG mentions the addition of the -o option.

The script polls iostat for a period of (say) 10mins to measure the i/o and if it is zero it decrements the (say) 1hr timer and if not it resets it. If the timer gets to zero it spins down the drives. So after a rolling sequence of 6x 10min periods of no I/O it spins down the drives. So for one-shot you have to have poll and timeout the same and you cannot have a rolling period.

If you run the script in one shot mode (-o) it ignores the TIMEOUT value and only looks at POLL_TIME. This behavior is documented in the README, the help text -h, and an additional notice is displayed in the console every time the script is invoked with -o.

I just pushed another commit to develop to replace the TIMEOUT value in the logs when using one shot mode to prevent further confusion :)

I was assuming that the o/s could tell you how long since the last i/o because for its own spindown functionality it tracks this data. [...] It seems to me that to get the functionality I have in mind, we probably need two separate scripts.

If I understand you correctly, your use case is: After a specific time in the evening, run a cron job every 10 minutes that executes the script and spins down idle drives.

This behavior is possible with the one shot mode as it is currently implemented. You create a cron job with a schedule like */10 00-04,22-23 * * * and the following command: ./spindown_timer.sh -o -p 540 -l -m zpool -i <YOURPOOL>. It would result in the following chain of action:

22:00 -> Execute spindown_timer.sh
1. Monitor I/O for 540 seconds (9 min)
2. Detect drive active/idle state
3. Drives active -> Do nothing
22:10 -> Execute spindown_timer.sh
1. Monitor I/O for 540 seconds (9 min)
2. Detect drive active/idle state
3. Drives active -> Do nothing
22:20 -> Execute spindown_timer.sh
1. Monitor I/O for 540 seconds (9 min)
2. Detect drive active/idle state
3. Drives idle -> Spindown
22:30 -> Execute spindown_timer.sh
1. Monitor I/O for 540 seconds (9 min)
2. Detect drive active/idle state
3. No drives spinning -> Do nothing
4. ...

Both iostat and zpool iostat can take a -y flag to omit the stats since boot - you will need to reduce the count to 1 and not tail to remove the first line as well if you use this.

iostat does not support the -y flag, as per iostat(8). Though zpool iostat does support it, we cut the output from the bottom there anyways.

Whilst you are in development mode, any chance that the -c option (or a new option) could also summarise the iostat output to give some idea of either how many I/Os were done in the x second period or alternatively how long it has been idle?

Sadly, this is not easily possible. Since the (zpool) iostat call is wrapped inside a bash function, stdout is used for the function return value. So if I try to log something to it, it would end up in function return value instead of the console. I could instead log this to stderr but I'd like to keep this for real errors. It would work with syslog or like a debug mode that writes raw data to a separate file. But you could just run your iostat separately or simply use the I/O graphs provided by the TrueNAS GUI instead :tada:

Sophist-UK commented 9 months ago

No - my use case is to vary the idle time before spindown by day of week and time of day. If on a Friday evening I want an idle time of 1 hour, suppose I run the cron job exactly on the hour, what I don't want is to run iostat for 1 hour where it catches the last I/O at 1 second past the hour, and then run the cron job again 1 hour later to see that it is now idle, and so get a spindown 1 hour and 59 minutes after the last i/o i.e. instead of a spindown after c. 60 minutes idle this will give me a spindown after somewhere between 60mins & 120mins. I really want the same rolling 5 minute iostat, and do the timeout when the accumulated idle reaches 60mins i.e. when we have had 12 consecutive 5min idle periods which would mean between 60 and 65 minutes of actual idle time (or a 1min iostat and get 60 consecutive 1min idle periods for spindown between 60 and 61 minutes idle..

My suggestion is as follows:

That we run one script which loops and runs iostat/zpool iostat for (say) 5 minutes, and then logs in a file in /tmp the current idle time (for every spinning drive on the system or for a specific set of drives or pools) either 0 if there was an I/O in the last 5 minutes or an accumulated idle time if not. To avoid the second script reading a partially written file, it would write this to b.txt, and then delete a.txt and rename b.txt to a.txt. A timestamp would also be written so that the second script can confirm that the file isn't stale. This would be a cut down version of the current script, essentially without the action functionality, but with file writing functionality - or it could be provided in the existing script with a new flag.
The second script would also be a cutdown version of the existing script run on a cron job, whereby it reads the file in /tmp (waiting a sec if it doesn't exist), checks that the timestamp is sufficiently recent, and then checks the idle times against the timeout specified to see if they should be spun down and then exit. IMO this script would best be a separate file, but it could be done with a switch again.

I am sorry to have put you to the wasted effort you have already put in by my previous lack of understanding.

P.S. -c enhancement - I am not actually asking for the IOSTAT output (which you could read and echo if I was asking for this), but rather an echo/syslog either of the i/os read into a variable or of the accumulated idle time (or remaining countdown time).

P.P.S. What happens now with the current production script if it is looping and polling for 10 minutes, and we get 5 consecutive idle iostats, and then in the very short time whilst it is processing the output of the 5th and before it starts the 6th there is a single solitary I/O? I presume it will spin down when it shouldn't. (That said, this would hardly be the end of the world.)

ngandrass commented 9 months ago

Well ... This is getting a lot more complicated than I expected :sweat_smile: Could you provide an (partial) example schedule for a week that indicates which idle time is desired on which day and time?

For example:

[Mo. 00:00 - 07:59]: Spindown after 60 idle minutes
[Mo. 08:00 - 21:59]: No spindown
[Mo. 22:00 - 00:00]: Spindown after 60 idle minutes
[...]
[Sa. 23:00 - 23:59]: Spindown after 180 idle minutes
[Su. 00:00 - 06:00]: Spindown after 120 idle minutes
[...]

Just so that I can get the full picture of variability you require.

My suggestion is as follows:

Creating and maintaining three different spindown scripts seems to be a little bit out of scope for my taste. It would also make using and configuring the script much more complicated for people who are not so tech-savvy.

Another idea that came to my mind: You could create a wrapper script that runs/stops the spindown timer script, based on the current date and time, with the desired timeouts. This would allow us to leave the existing spindown timer as is and act as a sort of add-on.

This should work without a problem if you alternate between spindown and no spindown periods. However, if you want to switch between a short and a long timeout, the idle timers will begin counting from the full timeout length, whenever you re-create a new spindown timer instance, effectively ignoring if the drive was already idle during the previous short timeout period. Whether this solves your problem, depends on your requirements on the schedule.

P.S. -c enhancement - I am not actually asking for the IOSTAT output (which you could read and echo if I was asking for this), but rather an echo/syslog either of the i/os read into a variable or of the accumulated idle time (or remaining countdown time).

I think I don't quite understand your requirement here. Just to clarify:

I can not log individual I/Os, since I only get a summery of I/Os that happend during one poll interval, after POLL_TIME (-p) seconds.
The current idle timers of each drive are logged in verbose mode (-v). They are initialized with the value of TIMEOUT (-t) and are decremented by POLL_TIME (-p) after every poll interval, if a drive experienced no I/O. If, instead, I/O was performed on a particular drive, its timer is automatically reset to TIMEOUT, allowing to tell active drives apart from idle ones.

I could theoretically also output the names of all drives that were detected as idle or active after every poll interval. But before I think too far... What is your use case for this feature? Which information are you missing?

P.P.S. What happens now with the current production script if it is looping and polling for 10 minutes, and we get 5 consecutive idle iostats, and then in the very short time whilst it is processing the output of the 5th and before it starts the 6th there is a single solitary I/O? I presume it will spin down when it shouldn't. (That said, this would hardly be the end of the world.)

I/Os in between polling intervals will be missed, yes. This is due to the fact, that we do not use multiple threads, hence have to restart iostat in between runs and wait for the script logic to finish computation. I measured the time it takes to process the data between two consecutive poll intervals. Results shown that on my, rather old (Intel(R) Xeon(R) CPU E3-1220L :rofl:), machine, it takes no longer than 5 ms to spin up the next iostat call. Since most I/O should require more time, this should be fine in my opinion :)

Sophist-UK commented 9 months ago

My own use case might be e.g.:

M-F 10am-4pm - No timeout
Sat/Sun 10-10pm - No timeout
M-F 4pm-10pm - 2 hour timeout
All week - 10pm-10am - 1 hour timeout

So I would run cron jobs at the start times for each of these to kick off the timeout, with the script ending by itself at the end time. Would this be handled by the current script?

It seems to me that the only issue with the current script is that the rolling window starts afresh each time you start the script - and my suggestion of a two script approach is intended to separate out the data collection using a rolling window from the code that decides whether the idle time has been exceeded, so that data collection runs continuously and maintains the current idletime without having to restart the window.

The vast majority of the two script solution would use existing code. It is not a complete rearchitecture/rewrite from scratch.

For the detailed questions:

I am not asking for individual I/Os to be logged. Only the summary of I/Os provided to the script by iostat.
I would like -v to output number of I/Os in the poll-time interval if there were any and the accumulated idle-time if there were no idles.
It is not clear whether iostat will include I/Os in progress at the start and end of the interval. But as I said, if an i/o happens in the time-gap between iostats and it not captured, this is "not the end of the world" (by which I mean it really, really, really isn't worth the effort to attempt to close this gap using asynchronously forked sub-process iostats).