dnode accounting for quotas

behlendorf commented 9 years ago

During the OpenZFS summit @thegreatgazoo and I had a chance to discuss this. We came up with a nice clean solution based on @johannlombardi initial patch. If Issac has the time to do the heavy lifting on this I'm happy to work with him to get the patch reviewed and in to a form where it can be merged. There's a fair bit of work remaining to get it where it needs to be. But it should be pretty straight forward:

Required functionality:

Add a feature flag.
Rework the code so this feature can be enabled without rewiring every dnode.
Rework the code to use sync tasks so it can be resumed across pool import/exports.
Update the zpool status command to provide status information while the feature is being enabled.
Add the ioctl() handlers so utilities such as repquota work.
Update send/recv to handle this change.
Update the zfs(8) and zpool-features(5) man pages.

Implementation details:

One way to enable it without rewriting every dnode is to take a snapshot

This is a good idea but we can do better. If we were to use full snapshots there are some significant downsides.

We'd need to take a snapshot per dataset and it's not uncommon for pools to have 1000s, 10,000s, or 100,000s of datasets. Doubling this just to enable the feature is a bit heavy handed.
While the snapshots exist we can't free any data in the pool since it will be referenced by a snapshot. This would be problematic for pools which are already near capacity.
I suspect cleanly handling all the possible failure modes will be fairly complicated. You'll need to do the creation of all the snapshots in a sync task and be able to unwind them all in the event of a failure. You'll also want to do it in a single tx so that either all the snapshots exist or none of them do. When we're talking about a large number of snapshots this may take a significant amount of time (several seconds).
Doing any operation which spans datasets complicates things considerably. If this could be avoided it would greatly simplify the problem.

Luckily, for this specific case a full snapshot isn't needed. Storing the TXG number in which the feature was enabled and the per-dataset dnode number for the traversal is enough. This is possible because every dnode already stores the TXG number it was originally allocated in (dn->dn_allocated_txg). We can also leverage the fact that the traversal will strictly happen for lowest to highest numbered dnodes. Which means we can split the problem up like this:

Newly allocated dnodes always update the quota ZAP
Freed dnodes update the quota ZAP as follows
- if (dn->dn_object < dataset->scan_object): dnode has been traversed by scan update quota ZAP
- if (dn->dn_object >= dataset->scan_object): dnode has NOT been traversed by scan no update needed
Dnode traversal scan, this can be done with the existing dnode iterator
- if (dn->dn_allocated_txg < feature_txg): dnode is not accounted for update quota ZAP
- if (dn->dn_allocated_txg >= feature_txg): dnode is new has been accounted for during create no update needed

The traversal part of this would need to be done in small batches in a sync task. This would allow us to transactional update the dataset->scan_object on disk so the traversal can be resumed and it simplifies concurrency concerns. Doing it this way addresses my concerns above and nicely simplifies the logic and the amount of code needed. For example, all that's needed to abort the entire operation or disable the feature is to stop the traversal sync task (if running) and remove the two new ZAPs.

paboldin commented 9 years ago

Brian, I want to participate as well. If you have some pointers where to start please share.

behlendorf commented 9 years ago

@paboldin the help would be appreciated! Pull request #2577 was a good initial prototype for adding this support but it suffered from one major drawback. Specifically it required that every dnode be rewritten when enabling the feature. That could be prohibitively expensive and it introduces some unfortunate failure modes if your pool is full so we needed a different approach.

I've attempted to summarize the updated design above so it can be implemented. At a high level it basically involves taking note of the specific TXG when this new feature flag is enabled. Then using the existing dnode iterator to traverse all the dnodes to build update the required accounting. Using just the TXG number when the feature was activated this can be done using the logic described above.

@paboldin your definitely not the only person interested in this functionality but so far no one else has had the time to implement it. My suggestion would be to start by looking at how some of the other feature flags have been implemented to familiarize yourself with that infrastructure. Then look in to using a sync task to traverse the dnodes and build up the accounting. See the dsl_sync_task() function. Just let me know if you get stuck and I'll try and point you in the right direction!

jxiong commented 9 years ago

I pushed a new patch and fixed the style issues. Please inspect.

behlendorf commented 9 years ago

@jxiong great news. I'll look it over, I think have @don-brady do a review would be good too.

paboldin commented 8 years ago

@behlendorf ,

As far as I see from the code, upgrading to a user/group space accounting version is done in a blocking manner at the time (note that dmu_objset_userspace_upgrade is waiting for txg to sync).

Should the scheme described above be applied to this code as well? If so, then this will require a separate commit.

Can I use @jxiong's patches as base or are these under active (internal) development?

behlendorf commented 8 years ago

@paboldin aside from the work in #3723 I'm not aware of any active development on these patches. However, I'd very much like to get @jxiong's patch reviewed and finalized so they can be used as a base for further development. If you could help review these changes that would be great. But the basic scheme described here is still the preferred approach so the quota information can be recalculated online.

adilger commented 8 years ago

There is work for ext4 to add "Project Quotas" to match an equivalent feature in XFS - http://lists.openwall.net/linux-ext4/2015/09/13/2 and support for project quotas is also being added to Lustre.

This would require the addition of a new "Project ID" for each dnode (in a new SA), and additional quota accounting for the project ID. It would be great if the patch being implemented here also took this feature into account, to avoid needless conflict/incompatibility in the future. It may be that this is trivially handled through the use of a "pr-" prefix for project IDs like this patch uses "dn-", but I thought I would mention it before the patch is landed.

paboldin commented 8 years ago

@jxiong @behlendorf I was able to find another couple of bugs. One of them is in ZFS code already, another one is in the patch proposed.

The proposed patch misses the readout section for z_userobjquota_obj and z_groupobjquota_obj analogous to reading of values for z_userquota_obj and z_groupquota_obj. Since the code is not yet merged I'm going just leave this here.
The zfs_userquota_prop_to_obj function is that is used to fetch the object number for a given quota type returns ENOTSUPP in case the type is unknown. The caller never checks for this value being returned and can use incorrect object with number matching ENOTSUP. I think that the function must emit a warning and return 0 in that case.

paboldin commented 8 years ago

@jxiong do you need additional information from me? If some of the issues described in unclear manner please do not hesitate to ask for more info. I hope patches attached to the last commit are self-explanatory but willing to finish this commit ASAP and move forward.

jxiong commented 8 years ago

i'm tied up with other work, will take a look today. Thanks for testing and the patch.

On Tue, Mar 15, 2016 at 7:08 AM, Pavel Boldin notifications@github.com wrote:

@jxiong https://github.com/jxiong do you need additional information from me? If some of the issues described in unclear manner please do not hesitate to ask for more info. I hope patches attached to the last commit are self-explanatory but willing to finish this commit ASAP and move forward.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/zfsonlinux/zfs/issues/3500#issuecomment-196836591

paboldin commented 8 years ago

It looks like in the absence of quota set for the user zfs userspace shows the objused value instead. Need to investigate further though.

behlendorf commented 8 years ago

All of the functional quota tests from Illumos have been added to master. A good percentage of these tests needed to be disabled in the tests/runfiles/linux.run runfile. Getting all of these tests passing would go a long way towards getting thing ship-shape on the quota front. Plus then you'll be able to add a few new test case to verify this feature is working properly when merged.

tests/zfs-tests/tests/functional/quota/
tests/zfs-tests/tests/functional/userquota/
tests/zfs-tests/tests/functional/refquota/

paboldin commented 8 years ago

Patch to add output of the variables to the zfs userspace/groupspace (Probably wrong place though) https://github.com/paboldin/zfs/commit/3e4ce22d9e1d40b3d41289054f201438d06a4629

Also fixes a bug when all unresolved names are squashed to one entry.

[root@localhost ~]# zfs userspace nirvana/1100 -ip
TYPE        NAME             USED  QUOTA
POSIX User  1500966552   16198144   1671
POSIX User  root        258401280  14870

[root@localhost ~]# zfs userspace nirvana/1100 -inp
TYPE        NAME             USED  QUOTA
POSIX User  0           258401280  14870
POSIX User  100               512      1
POSIX User  2000         10158080   1152
POSIX User  531572586      393216    256
POSIX User  1008467872    5180416    642
POSIX User  1500966552    2429440    899
POSIX User  1791674584   13706240   1414
POSIX User  2202513206   16198144   1671
POSIX User  2388961035     393216    256
POSIX User  3314219856   12459520   1285
POSIX User  3577253261    2295296    129
POSIX User  4142225625    9321472   1290

adilger commented 8 years ago

@behlendorf - Jinshan pushed an updated patch on April 13, what is needed to move this forward?

behlendorf commented 8 years ago

@adilger thanks for reminding me about this one. What's needed is review and testing. I'll make another review pass on this but it would be great if @paboldin @ahrens could comment too. I'll also add this patch to the stack of changes we're testing locally with Lustre to get some real run time on this code.

paboldin commented 8 years ago

@behlendorf Well, I have looked through it few times and did quite a stress test of the code for my (soon to be released) module for ZFS quotatools support.

It works just great, we double checked that by comparing actual disk usage with a zfs {user,group}space output.

behlendorf commented 8 years ago

@paboldin thanks for reviewing and testing this feature. I'm glad to know that more people have put this through its paces! It's almost ready to be merged once the last few remaining issues get wrapped up. Speaking of which you wouldn't happen to have written any test scripts we should add the the test suite to test this feature?

paboldin commented 8 years ago

Unfortunately -- no. But I think I will do it.

behlendorf commented 8 years ago

@paboldin you should be able to extend the existing userquota tests to get pretty good coverage. Although more is always better!

adilger commented 8 years ago

Any news on moving forward with this patch? It has been refreshed recently, and got 3rd party testing. Not sure what remains to be done...

behlendorf commented 8 years ago

@adilger @jxiong thanks for reminding me about this one. I've looked over the patch and it looks like there are just two things left to be wrapped up.

The userquota and groupquota tests in tests/zfs-tests/tests/functional/userquota/ need to be enabled and extended to test this new feature.
Once the tests are enable it needs to be rebased on master to get a clean run of those test cases. The patch is only about 92 commits behind so the rebase should be straight forward.

paboldin commented 8 years ago

I run into couple of issues with the testing of this.

First, it will require to add a few new users to check the usage. I have no other place to do it than the initalization of quota-specific tests. I'm on the tests now, but I barely have time to get this ahead.

Second, there is actual rebase required since some of the things around the code have changed (large-block feature added).

jxiong commented 8 years ago

@paboldin I have done rebase and there was no conflict at all. If you have started the test work, can you please share me your work to save me some effort?

behlendorf commented 8 years ago

First, it will require to add a few new users to check the usage.

The setup.ksh script in userquota should create the required groups and users for the purposes of the tests. This functionality may not have worked in the past but was resolved when support for delegations was merged. The only remaining caviot is that when running in-tree group or world read permission must be set on the zfs directory and its parents so the newly created uses can access the test scripts and utilities.

behlendorf commented 8 years ago

At some point we're also going to need to reconcile this feature with #4709. Laying this on #4709 may make it easier to push back upstream to OpenZFS.

jxiong commented 8 years ago

@behlendorf I pushed a new patch and add new tests to verify user object accounting. Please take a look.

behlendorf commented 8 years ago

@jxiong thanks for the quick turnaround. After building the change locally for some manual testing I ran in to a few more things which need to be resolved. Functionally everything worked as designed with one exception so these are mainly user interface concerns which @adilger and @paboldin may have some thoughts on.

The userspace and groupspace sections of the zfs(8) man page needs to be updated to mention the new properties userobj@user, userobjquota@user, groupobj@user, groupobjquota@user.
You are able to set a user object quota with zfs set userobjquota@behlendo=1000 but nothing is enforced. From a user perspective this will appear completely broken. Checks need to be added to the create path to enforce this, you've already done 90% of the work by adding the properties for an objquota. The enforcement can be done lazily just like for bytes.

$ zfs userspace tank/fs
TYPE        NAME       USED  QUOTA  OBJUSED  OBJQUOTA
POSIX User  behlendo  9.08M    10G    9.08K      1000
                                      ^^^^^

There was no change to the filesystem version number (as mentioned by the commit) and there's no way for a user to know they need to run zfs upgrade to activate this feature per-dataset. For consistency with the other features it would make more sense to automatically upgrade all datasets when the pool feature is enabled. This has the benefit of allowing you to shed the need for a new ioctl.

zpool set feature@userobj_accounting=enabled tank

When testing this I originally thought the patch was over counting the number of objects by 3x. Then I remembered that when selinux is enabled each file actually consumes 3 objects when xattr=on (file, xattr dir, xattr). Setting xattr=sa dropped the object usage back down to the expected 1 object per file. Functionally this is all correct but it will be confusing for users who are trying to set quotas. We should either:
- Exclude xattr-related objects, or
- Clearly document this as expected behavior in zfs(8)

My preference would be to update the documentation. I think there's value in reporting the actual usage and not misleading our users who are pretty savy about this kind of thing. It just needs to be explained.

jxiong commented 8 years ago

@behlendorf thanks for your suggestion. My only question is about upgrading all dataset automatically when pool is upgraded - the difficulty is that it has to iterate all existing objects in dataset to turn this feature on. In order to implement your proposal, we need to first iterate all dataset in the pool, and then iterate existing objects for each DMU_OST_ZFS, and the problem is that there is such a callback in feature_enable_sync(). Should we add a callback into zfeature_info_t or do you have any better ideas?

I will address your other concerns in the next revision.

behlendorf commented 8 years ago

Good point. How about in dmu_objset_do_userquota_updates() detecting a non-upgraded objset and then spawn a new kernel thread which can run dmu_objset_userspace_upgrade(). This has a few advantages:

It will be activated immediately for any in-use datasets effectively prioritizing them.
IO won't be wasted upgrading rarely used unmounted filesystems until they are used.
The potentially long running upgrade process will run asynchronously without the need for user involvement.
In the case of a crash the upgrade process will get automatically restarted.

The user need only enable the feature and eventually the quota information will be up to date.

jxiong commented 8 years ago

Based on the facts as follows:

this feature is not much frequently used;
the code you mentioned won't be exercised after some time because newly created pool will enable this feature automatically;
your proposal is hard to implement - for example, if a dataset is mounted during the iteration, it will end up with an actively used dataset is not upgraded.

I tend to keep the code as-is to keep it easy to understand and let user decide which 'old' dataset should be upgraded. Lustre is so far the only user and I will take care of it on my side.

behlendorf commented 8 years ago

this feature is not much frequently used;

Hopefully this won't be the case. Once this functionality is available I'm sure many people will find it useful.

The code you mentioned won't be exercised after some time

That's true, but it's not an uncommon thing there are many places where something in the filesystem gets upgraded on next access. For example updating objects in a v4 dataset to SAs once upgraded to v5. No walk is performed in this case they're just updated as used but this is also dead code for any newly created v5 datasets. In fact, a dmu_objset_userused_enabled() check already happens in spa_sync()->dsl_pool_sync()->dsl_dataset_sync()->dmu_objset_sync() for sightly different reasons. This may be a better place to trigger the upgrade.

your proposal is hard to implement - for example, if a dataset is mounted during the iteration, it will end up with an actively used dataset is not upgraded.

I don't see why this poses new problems or adds any significant complications. All I'm suggesting is that the call to zfs_ioc_userobjspace_upgrade() be done in the context of a new kernel thread created for that purpose instead of in the context of the zfs upgrade process. Functionally, these things are equivalent. If there's a problem with upgrading mounted datasets it's an issue for both approaches.

The major reason I don't care for the zfs upgrade approach is because unless we bump the version number the command line utility will report that there is nothing to be upgraded. Yet running the command will still trigger the upgrade. That's going to be confusing.

So if we do decide zfs upgrade is the right way to go we need to make sure to get buy in from the other OpenZFS developers. There's the related issue that version 6 was already used by Oracle's ZFS, but these days I don't think many people confuse that version of ZFS with OpenZFS.

jxiong commented 8 years ago

Functionally, these things are equivalent. If there's a problem with upgrading mounted datasets it's an issue for both approaches.

I thought you suggested to upgrade all active dataset in the system - this will pose problems of race condition, for example, during this process a new dataset is mounted, etc. This may need some efforts to make it correct. I meant if this were useful code and exercised often, it would be worth doing.

In fact, a dmu_objset_userused_enabled() check already happens in spa_sync()->dsl_pool_sync()->dsl_dataset_sync()->dmu_objset_sync() for sightly different reasons. This may be a better place to trigger the upgrade.

The problem is that we have to dirty all dnodes in the dataset, which is not feasible to perform in sync context because this will hold txg sync way too long if the dataset has huge amount of objects.

I dislike the current way of upgrading either - actually I often forgot the command to upgrade individual dataset. But this is the best way I can find so far.

jxiong commented 8 years ago

I pushed a new update and in this new version, I updated:

object quota check
new test cases for object quota check(userquota_013_pos.ksh)
man page update for object usage and quota

Please take a look.

behlendorf commented 8 years ago

The problem is that we have to dirty all dnodes in the dataset, which is not feasible to perform in sync context because this will hold txg sync way too long if the dataset has huge amount of objects.

Right, I wasn't suggesting that we dirty the inodes in the sync context. Just that we use dmu_objset_sync() or even zfs_domount() as an appropriate location to spawn a new dedicated thread which calls zfs_ioc_userobjspace_upgrade(). This thread would then handle the traversal and exit when complete. This is the same as if this had been done in the context of the zfs upgrade ioctl.

There are some cases to consider like potentially needing to kill the thread before it completes to allow the pool to be exported. Allowing datasets to be mounted/unmount at the same time. And making sure there's only one thread doing the upgrade. But this should be relatively straight forward and many of these things could happen with the zfs upgrade approach too.

We're looking at needing to activate this feature on datasets with 10's of billions of objects. Even optimistically doing this upgrade is going to take many hours, maybe days, to complete. Being able to just enable the feature and let it run in the background until its complete is pretty appealing to us..

behlendorf commented 8 years ago

Thanks for the update, I'll put it through it's paces early next week. I made a few suggested changes to the wording in the man page updates. I also noticed some of the newly enabled test cases failed so we'll need to look in to exactly why.

adilger commented 7 years ago

Anything left to do for this patch? We'd really like to get this into 0.7.0 before it is released, so that we can fix the Lustre inode accounting, which causes fairly significant performance overhead and has correctness issues without this patch.

ahrens commented 7 years ago

@adilger FYI, on Tuesday Brian and I discussed some relatively minor changes that we'd like. I left a comment on the PR https://github.com/zfsonlinux/zfs/pull/3983

behlendorf commented 7 years ago

@adilger agreed, this feature is definitely a blocked for us for an 0.7.0 tag. Hopefully just some minor fixes required.

behlendorf commented 7 years ago

Merged in PR #3983.

yuri-ccp commented 7 years ago

I tested this new dnode support in the current 0.7 release and it's working greatly.

But I thing metioned here is not working.

@behlendorf mentioned in head post that one of required of required functionality is to implement ioctl() handlers so utilities such as repquota work. But I tested in CentOS 7 and the repquota still don't recognize the ZFS user/groupquotas. This is wasnt implemented?

paboldin commented 7 years ago

Here is a thirdparty module i authored: https://github.com/FastVPSEestiOu/zfs-quota

Please feel free to use it! Any feedback is welcome!

behlendorf commented 7 years ago

@paboldin now that this functionality is an a released version of ZFS, are you interested in integrating your thirdparty work in to ZoL?

paboldin commented 7 years ago

@behlendorf yes. We will need to discuss how you guys see the integration.

behlendorf commented 7 years ago

@paboldin can't we simply register a quotactl_ops struct with the super block at mount time? Then use your existing implementation to add the needed handlers? Or are there additional complications?

paboldin commented 7 years ago

@behlendorf IIRC, quota-tools require FSs to have a valid block device. I will take a look at this.

adilger commented 7 years ago

This also relates to issue #2922.

nickcmaynard commented 1 year ago

Hi folks, what sort of chance is there of @paboldin's work being integrated here?

Though small, we have a production multi-user Linux system, on which we have enabled ZFS user quotas. We can't report, warn, etc. on that, because the existing tools (repquota, warnquota) rely on the standard quota tools. I don't understand much of this, but I hope that @paboldin's work might help to solve that.

openzfs / zfs

dnode accounting for quotas #3500