Open behlendorf opened 6 years ago
The following was originally posted by @ryao in #7401.
@ahrens @behlendorf Here is what I propose. We can extend zpool to handle this by:
We can also add a new feature flag that will allow the errata scrub to mark pools as being unimportable on certain versions of certain platforms in the future, so that those versions will fail to import the pools while saying that a later driver marked them as having a known intregrity bug. Then a force import to override it be needed. That way if this should ever happen again in a tagged release, we would not need to abuse feature flags to prevent a repaired pool from ending up imported on a buggy version. This might be more complicated to do in Illumos than in FreeBSD or Linux because Illumos distributions handle versioning, but I expect that it is doable.
I still need to flesh out a few details, but that is the general gist of my thinking on how to deal with affected pools. What do you think?
Discussion from a few minutes ago:
14:23 | <trisk> | ryao: that seems fine if we can keep a reachability refcount on objects by looking at all zap entries during scrub, but can we detect that a was orphaned in the current dataset but is present in an earlier snapshot?
14:23 | <trisk> | that a file was orphaned I mean
14:23 | <PMT> | oh boy, the hole_birth problem with a different metadata property.
14:24 | <trisk> | (it would presumably look the same as a deleted file)
14:29 | <+ryao> | trisk: We can keep a reachability count while doing the scrub, but I have not worked out how to detect issues in snapshots efficiently yet. The worst case is that we need to iterate through each one. -_-
14:30 | <+ryao> | trisk: We also will need to handle the unlimited queue so that we don't make mistakes and try relinking something that really should be dead.
14:33 | <trisk> | ryao: if a file was present in a previous snapshot (birth time is prior to snapshot _and_ it has a link) but not the current tree, unless there's some other metadata available how would we know it wasn't subsequently deleted intentionally?
14:38 | <+ryao> | trisk: There is an unlimited queue. If it is not there and not in a ZFS directory, then it is an orphan. We will need to keep track of that while this is running to avoid getting confused.
14:40 | <+ryao> | trisk: I meant an unlinked queue.
14:41 | <trisk> | yeah for each file object traversed we have to keep a reference until we locate a directory entry for it, at which point it can be dropped
14:43 | <+ryao> | trisk: I'd rather not add references. That would pin memory or require unnecessarily many writes short of doing a hack. I'd rather just divert the unlinked queue's responsibilities to us so that we can handle it while this is running.
14:48 | <+ryao> | trisk: We'd keep a scan object around that would contain information about the inodes that were placed into the unlinked queue and those that are orphans so that we can wipe them out after we finish. We might be able to drain the queue while we run, but I'd rather think about that after I have a working implementation.
14:49 | <+ryao> | s/inodes/dnodes/
14:49 | <+ryao> | This feels alot like how a sequential resilver works.
14:50 | <trisk> | we might need fast random access to it during scrub because we'll encounter directory zaps with the links and have to remove from said queue
14:52 | <+ryao> | trisk: The ZAP itself could work nicely for that.
@behlendorf It is more than just finding orphans. Directory sizes on directories that lost file links will be wrong. Also, the files' dnode's SAs will contain the wrong link count versus what is in the file system. If a file had been linked into multiple directories, it will not be an orphan, but it will still have an incorrect link count. While this is running, we will need notification of all link and unlink operations on the tip so that we do not get confused.
More thought is needed on how to handle snapshots. I don't want to make assumption that the system time is correct and that anything before a certain date is okay (although that would be a great simplification), but if people were doing per minute snapshots since the event happened, iterating through them inefficiently will be incredibly slow.
@trisk and I bounced thoughts back and forth to come up with a way to handle things. We have an algorithm that should handle snapshots in a way that is still somewhat slow, but not unreasonably slow.
I will produce a design document that describes how it works (ideally later today) to aid implementation/review before I begin implementation.
@ryao a nice way to explore implementing your and @trisk's design would be to refresh #6209 and use it as a base for your work. This PR extends scrub so it can be run in user space on an offline pool. This would be convenient for development and for the moment would side step the issue of having to worry about the filesystem changing from underneath you. A user space scrub which simply reported any inconsistencies would be a good first step.
Additionally, let me reference https://github.com/openzfs/openzfs/pull/604 which adds basic likely leaked object detection to zdb
. This patch does currectly detect leaked files, but it also incorrectly reports hardlinks as leaked.
I think that the minimum useful thing to do would be to relink orphaned files of filesystems into lost+found. Everything else is gravy. Couple of notes:
zfs verify <filesystem>
that would fix up just that filesystem (while other things are running, but not in the background, and not preserving progress across reboots - this is similar to zfs remap
(from the device removal PR))@ahrens The design document would have addressed all of your points, but here is what @trisk and I come up with for each:
@behlendorf Thanks for the suggestion, but I have brought my KVM with virtfs root setup that I used to use a few years ago back into service for this. It is almost as good as developing in userspace. It should work well for this.
@ryao this may not be relevant, but note that the sequential scrub code is not in the zfs-0.7-release branch. We're planning to fully support it in 0.8.0. So you may want to avoid using any sequential scrub code in master for orphan recovery. That will make backporting to 0.7 less risky.
@tonyhutter Thanks for the heads up. That makes the design decision to be implemented separately from the conventional scrub inside the kernel feel much better. It will be in the design document that will go up either today or tomorrow. Hopefully, today.
@ryao another things to be aware of is that Lustre servers using ZFS do not strictly keep their link counts up to date and may also not strictly conform to the ZPL in other small harmless ways. Historically this has been fine since these datasets are not meant to be mounted through the POSIX layer. But the checker needs to be aware of these differences so it's safe to run on these datasets.
If someone is in the unfortunate situation of cloning a damaged snapshot, they will need to copy the data off and destroy it.
Why? Is there something preventing them from using a clone that has orphaned files?
I would suggest that we should fix clones too, but even if you don't want to do that due to the performance cost, it should be perfectly fine to keep using them (with the caveat that there might be orphaned files).
@ahrens The plan is to fix the clones, but we can never fix the origin snapshot without BPR. Unless we can somehow destroy an affected origin snapshot, we will not be able to completely repair a pool containing one.
In the common case of a file that never had multiple hard links, yeah you can just search the parent directory like zfs_obj_to_path() does. If he has one link, and it's present in the parent dir, then great - that file is not orphaned. If it isn't in that dir, then either it's orphaned, or it's on the delete queue, or it had multiple hard links at some point in the past, and the link in the parent dir was removed.
You might be able to take advantage of this to reduce the size of the data structures you need while searching for orphaned files. But you'll still need to traverse all directories to distinguish between the had-multiple-links case and the orphaned case. It's too bad we didn't keep track of all the parent dir's.
@ahrens I deleted the remarks after posting that I had interpreted the suggestion from IRC incorrectly and clearly have been looking at my screen for too long. I had hoped that you would see that and disregard it.
Anyway, thanks for the quick response. Exploiting this to reduce the size of the data structures is an interesting idea. I'll give it some thought.
@ahrens
It's too bad we didn't keep track of all the parent dir's.
It's not exactly sexy, but it does work :)
https://github.com/openzfsonosx/zfs/commit/af36adb5382235b578da524c3a75cb53d1ca69c3
@ahrens @lundman If we were to keep track of all of the parent links in the future following a disk format change, I would have scaleability concerns. The clones property works similarly and had been a scaleability issue at ClusterHQ. If this does not scale well, I could see it being something that an end user could exploit by making and destroying an inordinate amount of hard links to slow down the system.
@lundman I think the commit you referenced updates the one parent pointer when we can. It doesn't change the on-disk format to track all parents. And you could still end up with an incorrect z_parent (after removing the link). That said, your change is an improvement over what we have today and it would be good to get it upstream.
@ryao I suspect that your poor experience was with the performance of the clones property, rather than with the underlying data structures that track the list of clones (dd_clones and ds_next_clones_obj).
Retrieving the clones property requires determining the name of each clone, so it's O(number of clones * average number of filesystems per parent filesystem). In contrast, iterating over the clones is O(number of clones); adding or removing a clone is O(1), and the storage for the clones is O(number of clones).
I did think about a solution like HFS (stores a list in a XATTR), but that required cooperation from everyone, so I went with the simple "self healing" approach, so imported foreign pools could be corrected. But yes, if you delete the 'currently valid' parent, it will have an incorrect parentid until next lookup. I did not see a nice way around that. OsX probably relies on parentid than most, especially with the IOCTL's for 'next/prev hardlink' in the set.
@ahrens @behlendorf @lundman @tonyhutter @trisk The errata scrub design draft proposal document is up at:
https://dev.gentoo.org/~ryao/errata-scrub-design-doc.odt
It has sha256 863e3e01bb23ac19d9c2e0afed4d10250e9a332cccc2820aa7bf8bae3024e6db.
@ryao regarding errata-scrub-design:
If it has multiple references, its name shall be $OBJECTID-$NUM, where $NUM ranges from 0 to $REFCOUNT-1. If a collision occurs due to the user placing a file into lost+found, the name shall become $OBJECTID-$RANDOM. $RANDOM shall be a 64-bit pseudo-randomly generated number. If a collision still occurs, we iterate generating pseudo-random numbers until a collision does not occur. We could optionally add an iteration limit to prevent the theoretically impossible scenario of infinite recursion. In the event of hitting the iteration limit, we panic.
I would suggest to link a file with $OBJID-RECURSIONLIMIT (to mark the file that ran into the iteration limit) instead of 'we panic' - a panic will take down the whole system and would IMHO be worse than an incomplete repair.
In its initial implementation, snapshot creation, snapshot deletion and cloning are blocked for the duration of an erratum scrub.
This would basically turn the 'initial implementation' into an offline version (at least for some systems, should they be dependant on working snapshots/clones).
Additional thoughts:
As the defect in question looks more like a filesystem dataset level (and can, at least as far as I understood it and please correct me if I'm wrong on this, propagate through send/recv) than a pool level one: wouldn't it make more sense to address it on the zfs side by adding a zfs subcommand? This would allow to target individual filesystems, thus enable to check received ones, and could potentially also be extended for other functions (ideas: change xattr mode, increase copies, recompress to current algorythm - for existing files).
Also, the errata state of datasets|snapshots could IMHO nicely be reflected in a property: there it could be queried through the zfs subcommand (which is way more practical than copy/paste from zpool status output) and the existing property inheritance mechanics could be employed to implement the scan/repair process in a way that dosn't block create/destroy of snapshots/clones - while correctly tagging newly created ones while the process is running.
I would suggest to link a file with $OBJID-RECURSIONLIMIT (to mark the file that ran into the iteration limit) instead of 'we panic' - a panic will take down the whole system and would IMHO be worse than an incomplete repair.
After some thought, I have decided to go with .zfs/lost+found. It avoids this problem entirely, although I expect it to make the code harder to port to other platforms and less maintainable.
This would basically turn the 'initial implementation' into an offline version (at least for some systems, should they be dependant on working snapshots/clones).
I suspect that a version that does not allow concurrent snapshot/clone operations to be alright for the majority of people. I would prefer to allow snapshots and clones in the initial implementation too, but I believe that getting out a correct tool sooner without that functionality be more desirable to end users than doing it later. We'll see how things go as I get further along in the implementation.
As the defect in question looks more like a filesystem dataset level (and can, at least as far as I understood it and please correct me if I'm wrong on this, propagate through send/recv) than a pool level one: wouldn't it make more sense to address it on the zfs side by adding a zfs subcommand? This would allow to target individual filesystems, thus enable to check received ones, and could potentially also be extended for other functions (ideas: change xattr mode, increase copies, recompress to current algorythm - for existing files). @ahrens had a similar remark. After some thought, I have decided to go with that. I am will make a
zfs repair
subcommand instead of an erratum extension to scrub. It gives us more flexibility to control the individual things being checked too.
I probably should update the design document, although the deficit of comments made me wonder if there was much point. The updated version would basically be what it is already, plus the two changes noted in this post, so I'll delay updating it until the patch to implement zfs repair
is ready for review.
Also, the errata state of datasets|snapshots could IMHO nicely be reflected in a property: there it could be queried through the zfs subcommand (which is way more practical than copy/paste from zpool status output) and the existing property inheritance mechanics could be employed to implement the scan/repair process in a way that dosn't block create/destroy of snapshots/clones - while correctly tagging newly created ones while the process is running.
I like that idea. I will try adding that to the implementation. :)
@ahrens @GregorKopka Does this look better to you?
zfs repair
Displays status of any in-flight repairs and whether anything must be fixed.
zfs repair -s
Stops all pending or running repair operations.
zfs repair [-rs] dataset filesystem|snapshot
Checks for and non-destructively repairs consistency issues in filesystems or snapshots caused by
bugs in ZFS. Filesystems that cannot be fixed automatically will be marked as degraded. This
operation is not designed to repair things not specified. It is intended to be used only as instructed by
the OpenZFS developers or your operating system's developers.
-s Stops pending or running repair operations on specified filesystems
-r Repair all descendent filesystems and snapshots
...
Native Properties
...
consistency The status of a filesystem or as determined by ZFS repair. If no problem has ever
been detected, it is set to default. If a problem that cannot be repaired is detected, it
is set to degraded and should be destroyed. If a repair is in progress, it is set to either
checking or repairing. This will only ever show that a filesystem is degraded if a
known consistency regression that zfs repair was adapted to catch is detected.
Contact your OpenZFS platform's developers for instructions if this shows degraded.
It's a real bummer that this is necessary at all. I would prefer that we be very clear about when you might need to run this (i.e. almost never), and what it does (i.e. re-link orphaned file). I'd be happy to work with you on the wording.
Documentation aside, I agree that doing it on a per-filesystem basis will probably make this easier to implement. If the repair run synchronously, I don't think you need zfs repair [no args]
or zfs repair -c
(just hit ^c). Think of it by analogy with zfs receive
or zfs remap
.
How would we detect orphaned files aside from running this command (e.g. to update the consistency
property)? I think that instead of consistency
property, we might want a property that has the timestamp of the last repair.
What problems can not be detected but not repaired?
@ahrens
It's a real bummer that this is necessary at all. I would prefer that we be very clear about when you might need to run this (i.e. almost never), and what it does (i.e. re-link orphaned file). I'd be happy to work with you on the wording.
This was more for people here than for the man page to explain how the command would work rather than what I had intended on giving to users, although my working directory has it in the man page at the moment. You are right that it should almost never be used. We could probably deprecate it almost immediately and then drop it from the documentation a year later, assuming that the man page changes make it into the final patch. I would be happy to take any wording suggestions that you have.
Documentation aside, I agree that doing it on a per-filesystem basis will probably make this easier to implement. If the repair run synchronously, I don't think you need zfs repair [no args] or zfs repair -c (just hit ^c). Think of it by analogy with zfs receive or zfs remap.
I plan to make this asynchronous. That way I don't have to worry about what happens if the user does Ctrl+C. Would you prefer that I make it synchronous?
How would we detect orphaned files aside from running this command (e.g. to update the consistency property)? I think that instead of consistency property, we might want a property that has the timestamp of the last repair.
We don't detect orphan files aside from running this command. The consistency property is a taint flag. It follows the snapshot through clone operations and follows the clones through snapshot operations unless the clones are repaired. My tentative thinking is not to set the property on a dataset unless the dataset is damaged.
I will add hidden repair_timestamp and repair_txg properties to explain what has most recently been done on the dataset, although I will omit them from documentation. There is no point in showing this information to users.
What problems can not be detected but not repaired?
The only problems in the scope of #7401 that we can detect, but not repair, are inconsistencies in snapshots. Clones of inconsistent snapshots are an interesting case though. We can repair them, but we cannot efficiently generate the information needed to repair them for every snapshot during our scan to be able to store it (imagine 1 million snapshots). That means that we must set the consistency property to damaged. Perhaps I should make it display "damaged, repairable" in that case though.
@ryao Synchronous sounds fine to me, should also be a bit easier to implement.
Hidden repair_timestamp
and repair_txg
should IMHO only be set in case a repair operations was actually performed, should the filesystem pass the check these could well be omitted.
Possibly the consistency
property could, somewhen in the future, also be updated by a normal zpool scrub
in case data loss is detected in the dataset? It would be really nice if scrub would flag defective datasets/snapshots.
On further thinking, it would be quite cool if this could some day be extended by a syntax of zfs repair dataset filename < file
, use case would be to repair on-disk file data corruption by supplying the original data (zfs already knows the checksums, so it should be able to locate bad data and rewrite it with known, and checked good, data inplace). I would guess this would be welcome by quite some as it could be used to heal data errors which appear on non-redundant vdevs (like root pools in desktops/notebooks) without destroying the snapshot chain.
It's a real bummer that this is necessary at all.
You are right that it should almost never be used. We could probably depreciate it almost immediately and then drop it from the documentation a year later,
Agreed. While I think having such a tool would be great, we need decide up front about what we're designing here and why. It seems to me we have two reasonable options.
Option 1 - Commit to doing this as a fully supported feature. That would mean a significant amount of development effort and doing the following:
Personally, I think it would make the most sense to build these additional cohearency checks in to zpool scrub
. They could be requested with a flag, or if the performance impact turns out to be minimal they can always be done on filesystem datasets (and maybe alternate checks for Lustre datasets at some point). Additionally, there's nothing preventing us from adding a zfs scrub
sub-command as proposed in #7257.
Option 2 - Extend an existing tool like zdb
to inspect a pool offline to determine if there are orphans which need to be recovered. Other developer tools could be extended as needed to provide a way recover these orphans and relink them in a lost+found directory. This would be much simpler and quicker way to get an immediately useful tool.
@behlendorf The algorithm for doing these checks operates at a different level from zpool scrub and won't cover everything. Being needed for both 0.7.x and 0.8.x means that I would need to handle both incremental send/recv and the regular one in addition to operating in a different layer. It would be buggy and error prone.
Option 1 - Commit to doing this as a fully supported feature. That would mean a significant amount of development effort and doing the following:
I would rather have this be a feature only meant to be invoked when we mess up after it is updated to support it. The words "fully supported" worry me somewhat because I do not want to encourage users to run this except when we tell them to run it.
It must be cleary documented in the man page.
An initial draft of man page documentation is already written:
https://paste.pound-python.org/show/LFodNYCJJtRtIHj5C4is/
It must include ZTS test coverage which verifies it works as designed.
I am alright with designing ZTS test coverage. Would having a small (say 2MB compressed) file based pool suffering damage that we extract into a 1GB file, import, repair and then verify to be properly repaired be satisfactory? I already have one of those. I had planned to use it with some poking around in zdb as the basis for my own testing.
It must address all of the notes abov
I believe that my design documentation + notes on revisions to make address all of the comments.
It must be maintained long term as a full supported feature.
We can maintain it in case anyone who was affected realizes that they have a problem years in the future, but I would prefer not to keep it easily discoverable by users for the long term. If we do, let us at least have warnings galore saying that running it is almost always unnecessary unless a developer asked that they run it. I do not want this being treated like a fsck tool. I would like to deprecate it in 0.8.0 and remove the documentation in 0.9.0 so that users are not discovering it long after the need has passed. If a bug report reaches us that merits its use, we could always instruct the user on how to invoke it.
Option 2 - Extend an existing tool like zdb to inspect a pool offline to determine if there are orphans which need to be recovered. Other developer tools could be extended as needed to provide a way recover these orphans and relink them in a lost+found directory. This would be much simpler and quicker way to get an immediately useful tool.
While I have worked out a way to prune quite a bit of what needs to be checked to determine what is damaged, the time complexity is still sub-cubic and super-quadratic on snapshots for a single filesystem. For all filesystems' snapshots, it is super-cubic. I want the repair process for any damage caused by #7401 to be as painless as possible. An offline tool is not going to achieve that, especially not with the projected time complexity.
Since orphaned files don't become un-orphaned (except by this process), you could consider operating off of a snapshot (or a zpool checkpoint openzfs/openzfs#560).
is openzfs pool checkpoint feature being worked on to integrate in ZoL?
@mailinglists35 I did a dry-run port of the checkpoint patch to ZoL as part of the work on the recently-committed import updates (which are a prerequisite to checkpoint) and it was sufficient to show me that it's going to be a fairly heavy lift to get pool checkpoint integrated to ZoL. That said, I'll likely give it a try this weekend.
Describe the problem you're observing
An orphan file is a file which is fully intact but cannot be accessed through the filesystem heirarchy. By design typical damage to a ZFS filesystem can never result in an orphan file being created. Therefore ZFS does not currently contain an automatic mechanism to detect and recover orphan files. However, with a little bit of work these orphan files can be manually read using developer tools such as
zdb
and are not permenantly lost.This issue is to discuss possible strategies for the efficient automated detection and recovery of orphaned files when recovering the files from a recent snapshot is not possible.
Describe how to reproduce the problem
Issue #7401 introduced a difficult to reproduce bug which under very specific conditions could result in an orphan file being created.