Closed behlendorf closed 7 years ago
Brian, I want to participate as well. If you have some pointers where to start please share.
@paboldin the help would be appreciated! Pull request #2577 was a good initial prototype for adding this support but it suffered from one major drawback. Specifically it required that every dnode be rewritten when enabling the feature. That could be prohibitively expensive and it introduces some unfortunate failure modes if your pool is full so we needed a different approach.
I've attempted to summarize the updated design above so it can be implemented. At a high level it basically involves taking note of the specific TXG when this new feature flag is enabled. Then using the existing dnode iterator to traverse all the dnodes to build update the required accounting. Using just the TXG number when the feature was activated this can be done using the logic described above.
@paboldin your definitely not the only person interested in this functionality but so far no one else has had the time to implement it. My suggestion would be to start by looking at how some of the other feature flags have been implemented to familiarize yourself with that infrastructure. Then look in to using a sync task to traverse the dnodes and build up the accounting. See the dsl_sync_task()
function. Just let me know if you get stuck and I'll try and point you in the right direction!
I pushed a new patch and fixed the style issues. Please inspect.
@jxiong great news. I'll look it over, I think have @don-brady do a review would be good too.
@behlendorf ,
As far as I see from the code, upgrading to a user/group space accounting version is done in a blocking manner at the time (note that dmu_objset_userspace_upgrade
is waiting for txg to sync).
Should the scheme described above be applied to this code as well? If so, then this will require a separate commit.
Can I use @jxiong's patches as base or are these under active (internal) development?
@paboldin aside from the work in #3723 I'm not aware of any active development on these patches. However, I'd very much like to get @jxiong's patch reviewed and finalized so they can be used as a base for further development. If you could help review these changes that would be great. But the basic scheme described here is still the preferred approach so the quota information can be recalculated online.
There is work for ext4 to add "Project Quotas" to match an equivalent feature in XFS - http://lists.openwall.net/linux-ext4/2015/09/13/2 and support for project quotas is also being added to Lustre.
This would require the addition of a new "Project ID" for each dnode (in a new SA), and additional quota accounting for the project ID. It would be great if the patch being implemented here also took this feature into account, to avoid needless conflict/incompatibility in the future. It may be that this is trivially handled through the use of a "pr-" prefix for project IDs like this patch uses "dn-", but I thought I would mention it before the patch is landed.
@jxiong @behlendorf I was able to find another couple of bugs. One of them is in ZFS code already, another one is in the patch proposed.
z_userobjquota_obj
and z_groupobjquota_obj
analogous to reading of values for z_userquota_obj
and z_groupquota_obj
. Since the code is not yet merged I'm going just leave this here.zfs_userquota_prop_to_obj
function is that is used to fetch the object number for a given quota type returns ENOTSUPP
in case the type is unknown. The caller never checks for this value being returned and can use incorrect object with number matching ENOTSUP. I think that the function must emit a warning and return 0 in that case.@jxiong do you need additional information from me? If some of the issues described in unclear manner please do not hesitate to ask for more info. I hope patches attached to the last commit are self-explanatory but willing to finish this commit ASAP and move forward.
i'm tied up with other work, will take a look today. Thanks for testing and the patch.
On Tue, Mar 15, 2016 at 7:08 AM, Pavel Boldin notifications@github.com wrote:
@jxiong https://github.com/jxiong do you need additional information from me? If some of the issues described in unclear manner please do not hesitate to ask for more info. I hope patches attached to the last commit are self-explanatory but willing to finish this commit ASAP and move forward.
— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/zfsonlinux/zfs/issues/3500#issuecomment-196836591
It looks like in the absence of quota set for the user zfs userspace
shows the objused value instead. Need to investigate further though.
All of the functional quota tests from Illumos have been added to master. A good percentage of these tests needed to be disabled in the tests/runfiles/linux.run
runfile. Getting all of these tests passing would go a long way towards getting thing ship-shape on the quota front. Plus then you'll be able to add a few new test case to verify this feature is working properly when merged.
tests/zfs-tests/tests/functional/quota/
tests/zfs-tests/tests/functional/userquota/
tests/zfs-tests/tests/functional/refquota/
Patch to add output of the variables to the zfs userspace/groupspace
(Probably wrong place though)
https://github.com/paboldin/zfs/commit/3e4ce22d9e1d40b3d41289054f201438d06a4629
Also fixes a bug when all unresolved names are squashed to one entry.
[root@localhost ~]# zfs userspace nirvana/1100 -ip
TYPE NAME USED QUOTA
POSIX User 1500966552 16198144 1671
POSIX User root 258401280 14870
[root@localhost ~]# zfs userspace nirvana/1100 -inp
TYPE NAME USED QUOTA
POSIX User 0 258401280 14870
POSIX User 100 512 1
POSIX User 2000 10158080 1152
POSIX User 531572586 393216 256
POSIX User 1008467872 5180416 642
POSIX User 1500966552 2429440 899
POSIX User 1791674584 13706240 1414
POSIX User 2202513206 16198144 1671
POSIX User 2388961035 393216 256
POSIX User 3314219856 12459520 1285
POSIX User 3577253261 2295296 129
POSIX User 4142225625 9321472 1290
@behlendorf - Jinshan pushed an updated patch on April 13, what is needed to move this forward?
@adilger thanks for reminding me about this one. What's needed is review and testing. I'll make another review pass on this but it would be great if @paboldin @ahrens could comment too. I'll also add this patch to the stack of changes we're testing locally with Lustre to get some real run time on this code.
@behlendorf Well, I have looked through it few times and did quite a stress test of the code for my (soon to be released) module for ZFS quotatools
support.
It works just great, we double checked that by comparing actual disk usage with a zfs {user,group}space
output.
@paboldin thanks for reviewing and testing this feature. I'm glad to know that more people have put this through its paces! It's almost ready to be merged once the last few remaining issues get wrapped up. Speaking of which you wouldn't happen to have written any test scripts we should add the the test suite to test this feature?
Unfortunately -- no. But I think I will do it.
@paboldin you should be able to extend the existing userquota tests to get pretty good coverage. Although more is always better!
Any news on moving forward with this patch? It has been refreshed recently, and got 3rd party testing. Not sure what remains to be done...
@adilger @jxiong thanks for reminding me about this one. I've looked over the patch and it looks like there are just two things left to be wrapped up.
tests/zfs-tests/tests/functional/userquota/
need to be enabled and extended to test this new feature.I run into couple of issues with the testing of this.
First, it will require to add a few new users to check the usage. I have no other place to do it than the initalization of quota-specific tests. I'm on the tests now, but I barely have time to get this ahead.
Second, there is actual rebase required since some of the things around the code have changed (large-block feature added).
@paboldin I have done rebase and there was no conflict at all. If you have started the test work, can you please share me your work to save me some effort?
First, it will require to add a few new users to check the usage.
The setup.ksh
script in userquota
should create the required groups and users for the purposes of the tests. This functionality may not have worked in the past but was resolved when support for delegations was merged. The only remaining caviot is that when running in-tree group or world read permission must be set on the zfs directory and its parents so the newly created uses can access the test scripts and utilities.
At some point we're also going to need to reconcile this feature with #4709. Laying this on #4709 may make it easier to push back upstream to OpenZFS.
@behlendorf I pushed a new patch and add new tests to verify user object accounting. Please take a look.
@jxiong thanks for the quick turnaround. After building the change locally for some manual testing I ran in to a few more things which need to be resolved. Functionally everything worked as designed with one exception so these are mainly user interface concerns which @adilger and @paboldin may have some thoughts on.
zfs(8)
man page needs to be updated to mention the new properties userobj@user, userobjquota@user, groupobj@user, groupobjquota@user.zfs set userobjquota@behlendo=1000
but nothing is enforced. From a user perspective this will appear completely broken. Checks need to be added to the create path to enforce this, you've already done 90% of the work by adding the properties for an objquota. The enforcement can be done lazily just like for bytes.$ zfs userspace tank/fs
TYPE NAME USED QUOTA OBJUSED OBJQUOTA
POSIX User behlendo 9.08M 10G 9.08K 1000
^^^^^
zfs upgrade
to activate this feature per-dataset. For consistency with the other features it would make more sense to automatically upgrade all datasets when the pool feature is enabled. This has the benefit of allowing you to shed the need for a new ioctl.zpool set feature@userobj_accounting=enabled tank
xattr=on
(file, xattr dir, xattr). Setting xattr=sa
dropped the object usage back down to the expected 1 object per file. Functionally this is all correct but it will be confusing for users who are trying to set quotas. We should either:
zfs(8)
My preference would be to update the documentation. I think there's value in reporting the actual usage and not misleading our users who are pretty savy about this kind of thing. It just needs to be explained.
@behlendorf thanks for your suggestion. My only question is about upgrading all dataset automatically when pool is upgraded - the difficulty is that it has to iterate all existing objects in dataset to turn this feature on. In order to implement your proposal, we need to first iterate all dataset in the pool, and then iterate existing objects for each DMU_OST_ZFS, and the problem is that there is such a callback in feature_enable_sync()
. Should we add a callback into zfeature_info_t
or do you have any better ideas?
I will address your other concerns in the next revision.
Good point. How about in dmu_objset_do_userquota_updates()
detecting a non-upgraded objset and then spawn a new kernel thread which can run dmu_objset_userspace_upgrade()
. This has a few advantages:
The user need only enable the feature and eventually the quota information will be up to date.
Based on the facts as follows:
I tend to keep the code as-is to keep it easy to understand and let user decide which 'old' dataset should be upgraded. Lustre is so far the only user and I will take care of it on my side.
- this feature is not much frequently used;
Hopefully this won't be the case. Once this functionality is available I'm sure many people will find it useful.
- The code you mentioned won't be exercised after some time
That's true, but it's not an uncommon thing there are many places where something in the filesystem gets upgraded on next access. For example updating objects in a v4 dataset to SAs once upgraded to v5. No walk is performed in this case they're just updated as used but this is also dead code for any newly created v5 datasets. In fact, a dmu_objset_userused_enabled()
check already happens in spa_sync()->dsl_pool_sync()->dsl_dataset_sync()->dmu_objset_sync()
for sightly different reasons. This may be a better place to trigger the upgrade.
- your proposal is hard to implement - for example, if a dataset is mounted during the iteration, it will end up with an actively used dataset is not upgraded.
I don't see why this poses new problems or adds any significant complications. All I'm suggesting is that the call to zfs_ioc_userobjspace_upgrade()
be done in the context of a new kernel thread created for that purpose instead of in the context of the zfs upgrade
process. Functionally, these things are equivalent. If there's a problem with upgrading mounted datasets it's an issue for both approaches.
The major reason I don't care for the zfs upgrade
approach is because unless we bump the version number the command line utility will report that there is nothing to be upgraded. Yet running the command will still trigger the upgrade. That's going to be confusing.
So if we do decide zfs upgrade
is the right way to go we need to make sure to get buy in from the other OpenZFS developers. There's the related issue that version 6 was already used by Oracle's ZFS, but these days I don't think many people confuse that version of ZFS with OpenZFS.
Functionally, these things are equivalent. If there's a problem with upgrading mounted datasets it's an issue for both approaches.
I thought you suggested to upgrade all active dataset in the system - this will pose problems of race condition, for example, during this process a new dataset is mounted, etc. This may need some efforts to make it correct. I meant if this were useful code and exercised often, it would be worth doing.
In fact, a dmu_objset_userused_enabled() check already happens in spa_sync()->dsl_pool_sync()->dsl_dataset_sync()->dmu_objset_sync() for sightly different reasons. This may be a better place to trigger the upgrade.
The problem is that we have to dirty all dnodes in the dataset, which is not feasible to perform in sync context because this will hold txg sync way too long if the dataset has huge amount of objects.
I dislike the current way of upgrading either - actually I often forgot the command to upgrade individual dataset. But this is the best way I can find so far.
I pushed a new update and in this new version, I updated:
Please take a look.
The problem is that we have to dirty all dnodes in the dataset, which is not feasible to perform in sync context because this will hold txg sync way too long if the dataset has huge amount of objects.
Right, I wasn't suggesting that we dirty the inodes in the sync context. Just that we use dmu_objset_sync()
or even zfs_domount()
as an appropriate location to spawn a new dedicated thread which calls zfs_ioc_userobjspace_upgrade()
. This thread would then handle the traversal and exit when complete. This is the same as if this had been done in the context of the zfs upgrade
ioctl.
There are some cases to consider like potentially needing to kill the thread before it completes to allow the pool to be exported. Allowing datasets to be mounted/unmount at the same time. And making sure there's only one thread doing the upgrade. But this should be relatively straight forward and many of these things could happen with the zfs upgrade
approach too.
We're looking at needing to activate this feature on datasets with 10's of billions of objects. Even optimistically doing this upgrade is going to take many hours, maybe days, to complete. Being able to just enable the feature and let it run in the background until its complete is pretty appealing to us..
Thanks for the update, I'll put it through it's paces early next week. I made a few suggested changes to the wording in the man page updates. I also noticed some of the newly enabled test cases failed so we'll need to look in to exactly why.
Anything left to do for this patch? We'd really like to get this into 0.7.0 before it is released, so that we can fix the Lustre inode accounting, which causes fairly significant performance overhead and has correctness issues without this patch.
@adilger FYI, on Tuesday Brian and I discussed some relatively minor changes that we'd like. I left a comment on the PR https://github.com/zfsonlinux/zfs/pull/3983
@adilger agreed, this feature is definitely a blocked for us for an 0.7.0 tag. Hopefully just some minor fixes required.
Merged in PR #3983.
I tested this new dnode support in the current 0.7 release and it's working greatly.
But I thing metioned here is not working.
@behlendorf mentioned in head post that one of required of required functionality is to implement ioctl() handlers so utilities such as repquota work. But I tested in CentOS 7 and the repquota still don't recognize the ZFS user/groupquotas. This is wasnt implemented?
Here is a thirdparty module i authored: https://github.com/FastVPSEestiOu/zfs-quota
Please feel free to use it! Any feedback is welcome!
@paboldin now that this functionality is an a released version of ZFS, are you interested in integrating your thirdparty work in to ZoL?
@behlendorf yes. We will need to discuss how you guys see the integration.
@paboldin can't we simply register a quotactl_ops struct
with the super block at mount time? Then use your existing implementation to add the needed handlers? Or are there additional complications?
@behlendorf IIRC, quota-tools require FSs to have a valid block device. I will take a look at this.
This also relates to issue #2922.
Hi folks, what sort of chance is there of @paboldin's work being integrated here?
Though small, we have a production multi-user Linux system, on which we have enabled ZFS user quotas. We can't report, warn, etc. on that, because the existing tools (repquota, warnquota) rely on the standard quota tools. I don't understand much of this, but I hope that @paboldin's work might help to solve that.
During the OpenZFS summit @thegreatgazoo and I had a chance to discuss this. We came up with a nice clean solution based on @johannlombardi initial patch. If Issac has the time to do the heavy lifting on this I'm happy to work with him to get the patch reviewed and in to a form where it can be merged. There's a fair bit of work remaining to get it where it needs to be. But it should be pretty straight forward:
Required functionality:
Implementation details:
This is a good idea but we can do better. If we were to use full snapshots there are some significant downsides.
Luckily, for this specific case a full snapshot isn't needed. Storing the TXG number in which the feature was enabled and the per-dataset dnode number for the traversal is enough. This is possible because every dnode already stores the TXG number it was originally allocated in (dn->dn_allocated_txg). We can also leverage the fact that the traversal will strictly happen for lowest to highest numbered dnodes. Which means we can split the problem up like this:
The traversal part of this would need to be done in small batches in a sync task. This would allow us to transactional update the dataset->scan_object on disk so the traversal can be resumed and it simplifies concurrency concerns. Doing it this way addresses my concerns above and nicely simplifies the logic and the amount of code needed. For example, all that's needed to abort the entire operation or disable the feature is to stop the traversal sync task (if running) and remove the two new ZAPs.