Open artee666 opened 5 years ago
Did you wait 2 hours after starting?
did you try to pause scrubbing before reboot? in theory that should make it continue from the point where it paused.
never tried that, please report if it works
−p
Pause scrubbing. Scrub pause state and progress are periodically synced to disk. If the system is restarted or pool is exported during a paused scrub, even after import, scrub will remain paused until it is resumed. Once resumed the scrub will pick up from the place where it was last checkpointed to disk. To resume a paused scrub issue zpool scrub again.
I've tried to pause the scrub (at approx. 2.54 %), reboot and after resuming the scrub, it started from scratch :(
I know that on debian and zfs 0.7.9 the scrub continued after reboot without any need of manual pausing and resuming of scrub (which is not working for me anyway).
i can confirm it resumes correctly with 0.7.13
@devZer0 I confirm that even when pausing a scrub, a reboot causes the scrub to restart. This applies to ZFS 0.8.x when using the new sequential scrub only (legacy scrub works as expected).
@artee666 Did you try to wait zfs_scan_checkpoint_intval
seconds (7200 by default) before rebooting?
@shodanshok I think, I have not... Will try to set this parameter to 10 minutes and see. Will report tomorrow...
@shodanshok So I've set the zfs_scan_checkpoint_intval
to 600 seconds, I've waited 11 minutes, rebooted and the scrub started all over again.
Would be also nice to create such checkpoint also when properly rebooting or shuting down the computer (on zfs module unload?), because in worst case scenario 7199 seconds of scrub could be lost.
Can you test with 300 second interval and waiting 20 minutes?
@scineram I've tried it and the scrub was started from 0 after reboot.
I've set the param zfs_scan_legacy to 1 and this kinda solves this issue for me.
Just to make sure everyone understands the gravity of this: it affects not only scrubs but also resilvers, see #9646
0.8.2 zfs kernel 5.2.7-arch1-1-ARCH, same problem on sequential resilver.
I can absolutely understand why the new sequential scrub and resilver behavior may be confusing. The code is working as intended. However, unlike the legacy scrub the sequential scrub design necessitates a tradeoff between maximizing performance and the frequency of on-disk checkpoints.
The default settings lean towards the performance end of the spectrum, which means checkpoints are relatively infrequent (about every 2 hours). This behavior is desirable for large HDD based pools which are rarely exported. It's less ideal for pools which are frequently imported/exported since a new checkpoint is not written when a pool is exported (or a scrub is paused). At a minimum we should update the scrub section of the zpool man page to better explain this.
As mentioned above setting zfs_scan_checkpoint_intval
to write checkpoints more frequently may help. Though be aware this isn't a hard limit and depending on your exact pool layout and hardware it may still take significantly longer than this between checkpoints.
The heart of the issue is that for sequential scrub / resilver to write a checkpoint it must first drain the in memory scan queues it has built up. To do this IO needs to be issued for everything in the queue, depending on the size of the queue and speed of the scrub this can take a considerable amount of time (many minutes). This time is in addition to the requested zfs_scan_checkpoint_intval
, which is why it's about 2 hours.
It's for this reason that the scan queues are discarded when running zpool export
instead of drained. The last on-disk checkpoint is then used for import which is why you can see the overall progress regress or reset. Pausing a scrub will also not result in the queues being drained. Though that functionality could be added with a little development work.
Revisiting the default settings may also be worthwhile. It wouldn't be unreasonable for the scrub to broadly take in to consideration your pool geometry and hardware when sizing the memory queues and checkpoint frequency. For example, for an all SSD pool where maximizing sequential access isn't as important a smaller memory footpoint for the scan queues and more frequent checkpoints would make sense.
@behlendorf thanks for sharing these information. However, here the user set zfs_scan_checkpoint_intval
to 5 mins and waited for 20 mins, still the scrub restarted after a reboot.
Is it the expected behavior?
Thanks.
i can see issue with current design: i have pool about 1PB, where one of dataset has 377T. if scan in progress for scrub or resilver - it stop to do others writes operations to dataset with big data. for example: one node was offline on one day and has been put back and scan has been started for resilver and block writes to dataset about 4h.
The confusion is that pause
is about stopping I/O, not about draining the queues and commiting a checkpoint. The problem with the latter is that it can take long, basically depending on average blocksize with the given memory limits.
For example few days ago I scrubbed my FreeNAS with 794 GiB allocated (they ported the same code with the same default 2h zfs_scan_checkpoint_intval
). Simple mirrored vdev, mostly multimedia, so 1 MiB blocks. It took 96 minutes, however the scanned
amount reached the allocsize about just over 30 minutes, when about 300 GiB was issued
. So basically the last hour was spent draining the queue, since the pool was otherwise idle. Even if zfs_scan_checkpoint_intval
was set to anything over 35 minutes, there would have been no scrub checkpoint.
That's exactly right. I believe part of the confusion here is that we don't currently provide any administrative interface to report on when the last scrub checkpoint was taken. Nor is there a nice way to bias the scrub towards more frequent checkpoints and away from maximizing performance. These are both areas where the UI could be improved.
See also https://github.com/openzfs/zfs/issues/9646#issuecomment-652854973
This bites hard on USB drives which fail to resilver every time, and so makes USB mirrors unusable in this scenario:
ZFS on USB (external) drives user here.
Laptop outta space, making use of two external USB drives, the first was created for backups and offloading the internal drive, and the second I attempted to add later as a mirror of the first, but so far am unable to do so!
First USB adapter is USB3 and seems to work ok. Second is really cheap, also USB3, but fails consistently after ~580MB (failed about same point, tested twice now) as follows (syslog):
Jul 02 17:10:35 eye kernel: usb 1-1.2: reset high-speed USB device number 69 using ehci-pci
Jul 02 17:10:45 eye kernel: usb 1-1.2: device not accepting address 69, error -110
Jul 02 17:10:45 eye kernel: usb 1-1.2: reset high-speed USB device number 69 using ehci-pci
Jul 02 17:10:56 eye kernel: usb 1-1.2: device not accepting address 69, error -110
Jul 02 17:10:56 eye kernel: usb 1-1.2: reset high-speed USB device number 69 using ehci-pci
Jul 02 17:10:58 eye kernel: sd 6:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_TIME_OUT driverbyte=DRIVER_OK cmd_age=35s
Jul 02 17:10:58 eye kernel: sd 6:0:0:0: [sdb] tag#0 CDB: Write(10) 2a 00 ce a3 2c 88 00 00 80 00
Jul 02 17:10:58 eye kernel: blk_update_request: I/O error, dev sdb, sector 3466800264 op 0x1:(WRITE) flags 0x700 phys_seg 1 prio class 0
Jul 02 17:10:58 eye kernel: zio pool=zb2t01 vdev=/dev/disk/by-id/usb-WDC_WD20_SPZX-22UA7T0_0000000000016742-0:0-part1 error=5 type=2 offset=1774999703552 size=1048576 flags=40080caa
So at least the first drive still works, but every reconnect of the second mirror drive restarts a full resilver, which always fails - after nearly 7 hours of resilver!
Also wrote my steps as part of ZFS tutorials here: https://github.com/zenaan/quick-fixes-ftfw/blob/master/zfs/zfs.md#user-content-step-5b---clear-resilvering-errors
This "restart resilver on every interruption" ZFS bug, may be considered a sort of ultra conservative, ultra paranoid thing.
But the fact is that the ultra paranoid can run a scrub after an interrupted resilver later finishes, and ZFS could anyway (possibly) auto schedule a scrub if the resilver has been interrupted at all...
The irony of this "ultra paranoid" ZFS behaviour is that it is not a usable filesystem on USB drives which might be connected through cheap USB adaptors.
And the sweet irony of reverting ZFS to its previous (slightly less paranoid) behaviour, is that it will instantly become the ONLY safe filesystem to use in such situations.
So +1 for reverting this behaviour and allowing interrupted resilvers to continue when the drive is reconnected.
@zenaan Slightly off-topic, but you could try to use the usb_storage
driver instead of the UAS
driver for the failing disk. I had a USB3 disk failing in a similar fashion and blacklisting the UAS
driver fixed it for me. You can do this by adding usb_storage.quirks=xxxx:yyyy:u
to your kernel options or an corresponding /etc/modprobe.d
entry. xxxx:yyyy
is the vendor and product ID of the disk. You can get them with lsusb
.
I think there is something wrong with the sequential resilver. Even if @behlendorf says it is expected. The zfs_scan_checkpoint_intval should indicate a time between checkpoints and scan queue full drained.
I set zfs_scan_checkpoint_intval to 30. I then added a disk to a single vdev to make a mirror. What I noticed was the 900GB scanned almost immediately, few seconds. So from what i understood, zfs was trying to write these 900GB before making a snapshots considering the it would have needed more than 30 seconds to do so.
Correct me if I'm wrong.
From wiki I see that there is a parameter called zfs_scan_mem_lim_fact that should limit zfs from grabbing more metadata in memory forcing zfs to flush to disk these block before scanning new blocks.
Correct me if I'm wrong.
At this point I ask: 1) why does zfs scan 900 gb immediately at resilvering start? 2) to match zfs_scan_checkpoint_intval should't we allow to set checkpoint also if the queue is not empty?
From my point of view, not knowing the actual implementation, the sequential scrub/resilvering set as default seems a bad solution in respect of the previous algorithm.
Using zfs_scan_legacy solved the problem for now.
Any updates on this matter?
Hello, I have the same problem on my test server with ZFS 2.0.3-1 . Scrub resume doesn't work properly. It starts from 0% after zpool export/import, even if I pause scrub and wait over 16 hours before exporting zpool.
Scrub status before zpool export: ]root@192.168.175.85:~$ zpool status pool: Pool-0 state: ONLINE scan: scrub paused since Mon Mar 8 15:16:39 2021 scrub started on Mon Mar 8 15:16:16 2021 6.76G scanned, 4.03G issued, 6.76G total 0B repaired, 59.61% done
Scrub status after zpool import: pool: Pool-0 state: ONLINE scan: scrub paused since Mon Mar 8 15:16:39 2021 scrub started on Mon Mar 8 15:16:16 2021 0B scanned, 0B issued, 6.76G total 0B repaired, 0.00% done
This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.
@behlendorf wrote:
It's less ideal for pools which are frequently imported/exported since a new checkpoint is not written when a pool is exported (or a scrub is paused). .. Revisiting the default settings may also be worthwhile. It wouldn't be unreasonable for the scrub to broadly take in to consideration your pool geometry and hardware when sizing the memory queues and checkpoint frequency.
Any update on this in general? I ask sort of from necessity due to the github "inactivity" setting/bot.
No update, but I've gone ahead and marked this as "Not Stale" to make sure it stays open.
The heart of the issue is that for sequential scrub / resilver to write a checkpoint it must first drain the in memory scan queues it has built up.
Thanks for the explanation! Though I wonder if you could give a very basic overview (or link to one) of how the sequential scan is actually implemented? This might help us understand the problem better and come up with ideas for "fixes" or improvements?
If I've understood the basics correctly, legacy scrub just scrubs records in the order it finds them, meaning the scrub can be quite randomised in its order on disk (especially on a long running, fragmented pool) which is why it can hurt performance (lots of random I/O). So where sequential differs is that it scans all the record metadata first and somehow builds up a list (or chunks of a list) of records in the order that they appear on disk, so it can scrub in fewer, linear passes (assuming no other activity)? Does that sound about right?
With that in mind I'm assuming the first pass doesn't actually build a list of literally every record in the pool, as that would require more memory (especially for very large pools), so I would guess it identifies the sector(s) of the disk it wants to scan, looks for all records located within those, sorts them into order then scrubs those, while it grabs records for the next sector(s) and repeats the process (presumably scanning and sorting in advance of the actual scrubbing)?
What confuses me is why checkpointing the sequential scrub should be complicated, as surely all that needs to be saved is a note of which sector(s) were most recently scrubbed for each disk, so the scrubbing can pick up from there when resumed? I guess I'm just unclear on what the scan queues would actually contain or why flushing them should be necessary before storing a basic note of where it left off?
I'm guessing it has something to do with atomicity, but since checksumming and scrubbing is really a statistics game anyway I wonder how careful we really need to be? The chances of a record being corrupted, and then being missed repeatedly in periodic scrubs that are frequently being paused and resumed, seems extremely low? Obviously that's not acceptable for a resilver, but for a scrub it should matter less if there's a tiny chance that something might be missed on a single pass?
Sorry for the text wall, but it would be useful to know what these queues actually store and why they need to be flushed, as there might be an alternative we can use?
@Haravikk Scrub checkpoint does not include "which sector(s) were most recently scrubbed for each disk". It includes pool-wide bookmark (objset, object, level and offset) it scrubbed up to. It came from original unsorted scrub. For sorted scrub this obviously represent only metadata scan stage, still done in that order, but it tells nothing about actual data scrub stage, executed pretty chaotically, trying to scan the most sequential data regions found by metadata scan. For system with relatively small RAM checkpoint should be created every couple hours by flushing the block queue. It is not a problem because metadata scan will any way be done in many small chunks. But if system has enough RAM to keep all the block pointers, I think it may never create bookmark at all until it complete all the scrub. May be I am stretching it a bit, but it is not impossible. Collecting more pointers we are doing scrub more sequential (ideally we collect all thepointers), but same time increasing time between possible checkpoints.
Just ran into this issue again, had to do a restart with a scrub in progress that had been running for 16 hours, when I re-imported it resumed the scrub from 0.00%, even though I had my checkpoint interval set to 30 minutes?
If the new scrub ignores the interval, then at the very least we need a way to force a new checkpoint manually, e.g- when we run zpool scrub -p
to pause, or zpool export
?
+1 to @Haravikk: the new scan code has been totally useless in my use case where I frequently have to export and later reimport the pool before the scrub/resilver is over.
I've set the checkpoint interval to as low as 1 minute, have tried scrub -p
before exporting, and every other advice provided here, and nothing helps: even when the scrub is over 90% done, when the pool is reimported it restarts from zero (and therefore, in my use case, it never finishes).
The only solution is to turn on legacy scan -- then the scrub resumes from where it was and everything is golden.
What I've done is to set legacy scan on in all my machines and forget about the new scan code, advantageous as it may be for cases where the scrub never needs to be interrupted.
Scrub restart behavior depends on the amount of memory. The more memory system has, the more it can accumulate block pointers for scan, the more time it will take to checkpoint the process. In ultimate case it may be that whole pool gets scanned at once and checkpoint won't happen until the scrub completes. Restrictions on the amount of scan could make it easier, but then scrub would be less sequential.
@amotin thanks for the additional data. But then, what the heck does zfs_scan_checkpoint_intval do? Is it just a placebo?
Also, why doesn't the effing scan checkpoint gets written when the pool is exported? It stands to reason that it should get written then and there at the very least, or else all the work so far just gets lost... just like it's indeed getting lost for me, for @Haravikk and quite a few others just here in this thread. And this should be independent of the amount of memory, the relative speed of the CPU, or whatever else, no? Or am I missing something?
Sorry for the rant, but this does get irritating, seeing an obvious issue like that just being dismissed time and again for almost 3 years straight with (AFAICS) no proper reason.
Checkpoints are implemented only on metadata scan stage, since that is the only part of scan that is reliably restartable. So on a system with less RAM with a pool of many small blocks the sorting code has to stop the metadata scan from time to time to free some more memory. At those points it should be able to more or less respect the zfs_scan_checkpoint_intval. On a system with lots of RAM metadata scan works in so big chunks, that when it is time to create a checkpoint, it is already too late to create it.
Thanks again @amotin. If checkpoint only applies to the metadata scan phase, I think it's indeed useless as that phase, even in my largest pools, never take more than a few minutes. It's the next phase that takes many hours and even days.
I think at least the documentation should be amended to make it clear, that any scan (ie scrub or resilver) that gets interrupted will restart from zero, and that this is only 'fixable' by turning on the scan legacy code.
@behlendorf : perhaps thought should be given to making the scan legacy code the default, as it's better to take longer than to never finish (never finishing breaks a number of use cases, mine among them, while being slower breaks nothing). Whoever wants faster scans and can guarantee they won't get regularly interrupted before they can finish, can then turn it off amd use the new scan code him/herself.
Sorry for the rant, but this does get irritating, seeing an obvious issue like that just being dismissed time and again for almost 3 years straight with (AFAICS) no proper reason.
I've described the technical reason why it is so. There is just no perfect solution. BTW I've recently committed bunch of patches that should in half reduce CPU usage during scrub. It won't help disks much, but just to sweeten the pill. ;)
There is just no perfect solution.
See my edited comment above: at least the documentation should be clearer, and perhaps the default scan code should be set to legacy.
BTW I've recently committed bunch of patches that should in half reduce CPU usage during scrub. It won't help disks much, but just to sweeten the pill. ;)
Thank you (and of course @behlendorf and all the other devs) for your hard work in making ZFS better. The world in general, and my world in particular, is a much better place thanks to you.
But, this scan code restarting from zero even after an export, indeed sucks rocks, and big rocks at that :-)
I don't think legacy is a good default. We had pools where legacy scrub took weeks. Most of systems are not rebooting as often. Though may be metadata scan phase should be limited not only by memory usage, but by the respective amount of data divided by average disk speed over few hours and the number of disks. But then the scan will be less sequential, i.e. potentially slower at the end, depending on fragmentation level.
My point is, taking weeks or even months or years, in my book, is not broken: worst comes to worst, the affected sysadmin can read the documentation (or come a-googling, even to this very comment) and see it can be sped up by turning off the old scan legacy code.
A scan that never finishes and always restarts from zero is broken: at the very least, performance will suffer forever (as the scan will always be running) and worse, there can be errors (further 'up' scan-wise) that never get detected nor corrected even if the pool has working redundancy.
It becomes even worse if the scan is due to a resilver: as it never finishes and always restarts from zero, the pool data is exposed and the next device failure can well lead to total pool data loss.
None of this seems in line with the stated ZFS 'philosophy' of protecting the data before any thing else.
It sounds to me like the real problem here is that the new scan doesn't really support checkpointing at all; surely there must be some way to track its progress on the actual data scrubbing part of the operation?
We need to be storing how far into each disk the scrub got (and when) and the scanned metadata progress at the time, so that if the scrub resumes, we can scan to the same metadata progress skipping over any records before the disk location, and older than the checkpoint, until we reach the last scanned block and can continue?
Otherwise it seems to me like a rethink is needed in how the old method is being treated as "legacy mode". It should be renamed "interruptible scrub" and an option added to zpool scrub
so that we can request an interruptible rather than sequential scrub on demand (and only for a single pool at a time), rather than having to change a global setting each time it becomes a problem. This wouldn't be unlike how we can already decide when zfs receive
is interruptible by setting the -s
option, except that in this case we'd be selecting which scrub behaviour to use, and saving it on the pool somewhere (so it resumes using the correct behaviour).
While some form of working check-pointing would be my preferred solution, at least having control over the behaviour in the command being run would make the option readily available, and it can be documented as an option on zpool scrub
where it's easily seen and enabled when required.
Though may be metadata scan phase should be limited not only by memory usage, but by the respective amount of data divided by average disk speed over few hours and the number of disks.
This makes good sense to me since it would allow us to be able to set zfs_scan_checkpoint_intval
in a meaningful way. Which was always the point of this module option after all. As previously mentioned, there is a design tradeoff here between maximizing performance and the checkpoint frequency. I think it's entirely reasonable to allow the user a level of control over this since there are very different usage models.
Better yet, as @Haravikk suggested we could expose this as a command line option for zpool scrub
. For example, a -c <frequency>
option could added to allow the target checkpoint frequency to be specified. Dynamically selecting the default frequency is probably also desirable, for instance in the case of a non-HDD bases pools there's significantly less benefit to the new scrub.
And of course, we should better document the behavior.
@behlendorf what you described seems the best approach to me: after, currently zfs_scan_checkpoint_intval
is basically useless if the system is able to scan metadata in a single sweep.
Scan metadata is already limited to 5% of allocated space. Should that be >>10
? How much memory does the scan metadata for one block take?
Is there any progress on this? I've been trying to script around and just found myself hitting this issue
It's still an issue; I just encountered this yesterday with a system that had to be shutdown; zpool scrub -p
does not force a checkpoint, so the scrub resumed from 0% having lost a significant amount of progress.
What confuses me is why can't just force a checkpoint when the scrub is paused?
My understanding is that sequential scrub is a two step process, the first (scanning) reads metadata and assembles a list of records sorted into the order they appear on disk, then the second step (issuing) actually works through this list comparing checksums. If a system has enough memory a complete scan thus enables all records to be issued sequentially with no random access except for any newly written data since the scrub began.
However, if a system doesn't have enough RAM then at some point the scrub has to just start issuing from the list it has, until it can clear enough memory to resume scanning for more. But for checkpointing we don't really need to worry about the "upcoming" list, only the current one. So to store a checkpoint what we would need would be something like:
I would assume that for 1) we'd be using something that corresponds to the order in which the metadata is scanned (transaction group ID?), while for 2) we could use something that can identify the last record that was issued – since we're scrubbing sequentially, could we not use the physical location in the pool for that?
In this way, to resume from a given checkpoint you would use the start and end points to recreate the same ordered list of records (minus any that no longer exist), then skip through this until you reach (or pass) the last record issued. At this point you can continue issuing for the current list, while scanning (memory permitting) to start building the next.
Am I even close to how this works, or have I completely misunderstood how the sequential scrub operates?
2) we could use something that can identify the last record that was issued – since we're scrubbing sequentially, could we not use the physical location in the pool for that?
@Haravikk You are not totally wrong here, but not so easy. When system has not enough RAM to complete the metadata scan, it starts periodically switching between scan and issue phases unless checkpoint creation is requested. When issuing, it does not do it sequentially (unless scan phase has completed), but only issues the biggest and most contiguous of available ranges (it tries to be sequential if ranges are equally good, but it is not guarantied). The score or individual ranges and so the issue order may be not exactly reproducible after restart if the pool has changed. Also that position would be different for each top-level vdev.
What I think we should do is to make scrub to earlier process accumulated sequential ranges if they are big and sequential enough even if it is not limited by RAM. This way it would both save us some RAM and reduce the backlog in case of requested checkpoint creation. It would not cover all the cases of fragmented pools, but should be easy to do.
Ah, so it's doing some kind of grouping by range in addition to sorting? That does seem like it would make it harder to work out what ranges might have been completed already or not…
Makes sense, although I'm curious what data the checkpoints store in this case, or does it need to write out the issue order that's been generated thus far, which is why it's currently so reluctant to do store checkpoints?
Still seems like zpool export
and zpool scrub -p
should be able to force a checkpoint though, is there a reason they currently don't? Even if it would involve writing a lot of restore data, that's got to be better than losing all of the progress.
I'm curious what data the checkpoints store in this case, or does it need to write out the issue order that's been generated thus far, which is why it's currently so reluctant to do store checkpoints?
Right. To create a checkpoint scrub process blocks new metadata scans and has to issue all the collected issue queue. Doing it too often may negate most of sorted scrub benefits. IIRC it is done once in few hours now.
Still seems like
zpool export
andzpool scrub -p
should be able to force a checkpoint though, is there a reason they currently don't? Even if it would involve writing a lot of restore data, that's got to be better than losing all of the progress.
Right. On large fragmented HDD pools with lots of RAM it may take hours to issue everything queued. And would we try to save the list of blocks, it could take gigabytes.
Just had a thought, but on a system with sufficient free space, would it be possible to use the same mechanism as zpool checkpoint
to guarantee a scrub, as a way to simplify persisting the scrub's progress?
Since a pool-level checkpoint should guarantee that all data on the pool when the scrub began remains available to scrub, it should be possible to fully recreate the scrub in the event it was interrupted, i.e- re-run the scanning step, performing issuing as it would have occurred before (using the same limits), minus the actual loading and checksumming etc. for records we've already done?
Once the pool checkpoint is completed it can be discarded. It shouldn't be necessary to continue beyond the checkpoint anyway, since any newly written data should already be verified.
This would probably make most sense as its own option signalling to use a checkpoint for this, but for any pool with a bit of freespace there should be no drawbacks to using a pool-level checkpoint for this.
Distribution Name | archlinux Distribution Version | rolling Linux Kernel | 5.4.0-rc1-mainline Architecture | x86_64 ZFS Version | 0.8.2 SPL Version | 0.8.2
Describe the problem you're observing
Scrub resets its progress after reboot
Describe how to reproduce the problem
Start scrub, check the progress, reboot, check the progress again