openzfs / zfs

OpenZFS on Linux and FreeBSD
https://openzfs.github.io/openzfs-docs
Other
10.69k stars 1.75k forks source link

Cold storage using ZFS import/export #11725

Open SyBernot opened 3 years ago

SyBernot commented 3 years ago

Describe the feature would like to see added to OpenZFS

A service that would work with nfs/samba and automount on the client side to export a pool when not actively being used and import it when it is needed

I'm just getting my head around this now but I can monitor showmount and smb status to see what shares are actively being used pretty reliably (there can be some issues with rmtab having stale info). I have set up automount on the client side to mount the share when a user needs it and unmount the share after a period of inactivity. What I'd like to do is when no shares from a pool are mounted by a client, trigger an event that exports the pool so the disks can spin down. Alternatively when a mount is requested from a client import the pool. Also some mechanism to periodically import and scrub a pool after some period of inactivity, after which the pool could be exported and spun back down. There might be a better way to accomplish this but this right now is my best idea.

How will this feature improve OpenZFS?

We are a team of two and manage about a PB of data. That data is used by probably 20 people and they are likely to only be looking at or processing a few tens of TB on any given day. ZFS is great, so far in 13 years not a single file has been lost to drive failure or bitrot. That being said drives have a finite lifespan and when half of your drives have been around for a decade MTBF starts to really dig into your pocketbook and workday. This is an effort to extend that life in a measurable way and have a way to maintain consistent data for much longer periods of time at a much lower cost with the biggest downside being initial on demand access time. Also there is power required to spin the disk and heat generated due to that. Cooler disks are happier disks. I'm actually expecting to see days to weeks between access for some of our data and over time that could add years to the potential runtime a disc has as well as save a bunch of energy.

Additional context

SyBernot commented 3 years ago

Also as an aside this same mechanism could be used for an archive server, import a pool, receive a zfs snapshot, spin down the pool.

tonyhutter commented 3 years ago

I assume you've already investigated and ruled out traditional HDD spindown? (hdparm -S)

SyBernot commented 3 years ago

I hadn't considered that because ZFS initiates a transaction group for metadata every 5 seconds and then sync's it, kinda seems like it would be pointless.

rincebrain commented 3 years ago

If your pool is otherwise idle and you don't have multimount protection on, I don't expect it to be arbitrarily writing to the pool every 5s - and watching iostat for 60s on my idle local system seems to agree with me.

So I think you could probably see some use from hdparm -S, assuming those conditions are true.

mlow commented 3 years ago

I've been using hdparm -S on the drives of one of my pools for a couple of years.. works as expected.

SyBernot commented 3 years ago

I'm not having much luck with this. Here's what I have tried so far. On a fresh pool with no mounts or data to speak of I set hdparm -S241 using hdparm -C I get
drive state is: active/idle I wait 30 min to an hour and the state has not changed. hdparm -B gives me APM_level = off So I tried setting hdparm -B127 wait some more and still no change if I force spindown with hdparm -y It in fact spins down (confirmed with smartctl and hdparm) but within a few seconds spins back up again with zfs logging an error to dmesg. All the while there is zero activity in iostat.

rincebrain commented 3 years ago

ZFS logging an error suggests that ZFS tried to do something with the drive and it didn't respond as expected (possibly "in time"). Do you have MMP on? What error is it logging?

SyBernot commented 3 years ago

NAME PROPERTY VALUE SOURCE datapool multihost off default

Generally the error takes the form of this:

[1729665.909190] sd 0:0:0:0: [sdc] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=9s [1729665.912186] sd 0:0:0:0: [sdc] tag#0 CDB: Write(10) 2a 00 01 80 28 b6 00 00 01 00 [1729665.915119] blk_update_request: I/O error, dev sdc, sector 25176246 [1729665.918022] zio pool=datapool vdev=/dev/disk/by-vdev/voon-disk-01-part1 error=5 type=2 offset=12889189376 size=512 flags=180880 [1729665.923797] sd 0:0:0:0: [sdc] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE cmd_age=0s [1729665.926696] sd 0:0:0:0: [sdc] tag#1 Sense Key : Not Ready [current] [1729665.929550] sd 0:0:0:0: [sdc] tag#1 Add. Sense: Logical unit not ready, initializing command required [1729665.932402] sd 0:0:0:0: [sdc] tag#1 CDB: Write(10) 2a 00 01 40 28 ab 00 00 01 00 [1729665.935205] blk_update_request: I/O error, dev sdc, sector 20981931 [1729665.937981] zio pool=datapool vdev=/dev/disk/by-vdev/voon-disk-01-part1 error=5 type=2 offset=10741700096 size=512 flags=180880

...but you only get that if you force the spindown, otherwise the drives just never spin down. I believe this is due to the transaction group sync. This is why I believe that the only way to ensure the disks spin down and stay spun down is to export the pool until data from the pool is requested.

rincebrain commented 3 years ago

I tried hdparm -y on a disk in my pool and as of 1 minute later, dmesg hasn't logged an error and zpool status hasn't noticed.

Manually issuing find /pool does prompt it to kick out a single read error on the drive, but the interesting point, to me, is that it hadn't automatically noticed, which suggests something is indeed issuing IOs to your pool and keeping the disks from idling.

I also tried hdparm -S1 [disk];sleep 30;hdparm -C [disk];sleep 30;hdparm -C [disk] and got back " drive state is: standby" both times. So this really appears to work for me.

Does zpool iostat [pool] 1 show pool IO periodically? It'd probably be worth determining where that IO is coming from, because as other people than I have also reported in this thread, ZFS is not expected to go touch the disks every few seconds without cause.

SyBernot commented 3 years ago

It's all zeros after that first row. There are some services running I could kill off and I'll double check the crons to see if I missed something. One likely culprit is irods but I have my doubts as I've not added this as a resource yet. The weird thing is I never see any read or write in the pool.

rincebrain commented 3 years ago

If zpool iostat -v [pool] 1 doesn't report any IO to the disks for a while, it's almost certainly the case that ZFS doesn't think it's doing any IO.

Does iostat [list of disks in the pool] 1 (e.g. iostat /dev/sda /dev/sdb [...] 1) (note: not zpool iostat 1) think there's any IO being done to the drives during the interval? Because if zpool iostat and iostat both think there's no IO being done by the OS to the disks, it's probably the case that the OS is doing no IO to the drives, and something else is keeping them awake. (And if it does see IO to the drives, then it's probably not from ZFS, and it'll be interesting to debug where it's coming from.)

SyBernot commented 3 years ago

Previously I was using iostat and seeing no activity. I've killed off just about any service I could think of that does monitoring or reporting it may have been monitorix I forgot I had that in the default profile. I'll give some time and see if they go to sleep now. If that turns out to be it this will be a very complex solution as I'll need to either dump my monitoring or find a way to stop the monitoring when the drives go to sleep.

rincebrain commented 3 years ago

Well, if iostat doesn't see anything, then it's probably not the OS (unless some software is doing really exciting queries).

You could use hdparm -S1 [drive] to convince it to go to sleep after only 5 seconds of (its idea of) inactivity.

I also just checked, and smartctl -i does not wake a drive out of that level of sleep, but smartctl -A (and implicitly smartctl -a) does. So if you have anything regularly probing SMART attribute counters, that could do it.

I'm not aware of a good way to monitor for non-{read,write} IOs to disks. There's probably a way to hack up a BPF program to do it...