Poor ctime resolution on all Linux filesystems

chucklever commented 7 months ago

Description

[Chuck Lever 2021-09-20 17:48:27 UTC] The Linux VFS layer clamps the resolution of file timestamps based on a set of per-filesystem type parameters. This clamping has been around since at least the beginning of the git era (2.6.12). The most modern filesystems clamp timestamp resolution to 1 jiffy.

NFS clients typically use ctime to detect changes made by other clients. But if multiple changes to a file can happen during the same jiffy, a client can't detect more than one of those changes.

One proposed solution was to force users to move to NFSv4 which mandates strict changeattr behavior. This does not help users who prefer or are stuck on NFSv3 for the foreseeable future.

Create a parking space for new ideas to address this shortcoming.

Comment 1

[J. Bruce Fields 2022-01-21 16:40:33 UTC] There's some question how much time we want to invest in NFSv3.

But do you see any reason we couldn't mix the i_version into the low bits of the ctime?

Comment 2

[Jeff Layton 2022-08-02 16:22:47 UTC] That might help mitigate the problem. The difficulty would be in how to ensure that when you're folding those bits in that the ctime never appears to go backward. Since we do use a steadily increasing value for i_version, masking in the lower bits of i_version might be sufficient to do that.

Comment 3

[Jeff Layton 2023-04-14 14:25:45 UTC] It's not so much that filesystems clamp resolution at 1 jiffy, but rather that they call ktime_get_coarse_real_ts64, which caches a ctime value and only updates it every jiffy.

In principle we could call ktime_get_real_ts64, but:

1/ it's less efficient to poll the actual clock every time

...and...

2/ it would make it look to the filesystem like the ctime is always changing on with writes, so we'd end up logging a lot more metadata transactions (which would be costly). With the way the filesystems use the coarse timestamps, they get to elide a lot of those now.

What I'm working on at the moment is a set of patches to make it use high res timestamps on a conditional basis. When the ctime is queried via getattr, we set a flag in the inode. If that flag is set on the next update, we'll use a high-res timestamp instead of the coarse value.

This should give us the holy grail, I think. We should be able to ensure that the ctime apparently changes when things change between getattr calls, but at fairly minimal cost.

The main problem at this point is where to put the flag, and how to ensure that we can do this on a per-fs basis.

Comment 4

[J. Bruce Fields 2023-04-14 14:30:35 UTC] "We should be able to ensure that the ctime apparently changes when things change between getattr calls"

As with changeattr, though, behavior across reboots can be more complicated.

Comment 5

[Jeff Layton 2023-04-20 15:53:17 UTC] Absolutely. This shouldn't make anything worse.

Also, handling the ctime across a reboot is a safer proposition than with i_version. If you crash and the i_version has an apparent rollback, then you could end up with an i_version collision when a new and different change causes the value be bumped again.

For that to happen with the ctime, you'd have to crash and roll back, and suffer a juuuust right RTC rollback at the same time. I think that's a little tougher to contrive.

My latest draft patchset for a high-res c/mtime uses the lowest bit in the tv_nsec field as a QUERIED flag (much like we do with i_version). If we're willing to sacrifice some more bits of ts granularity, then we have a little space to store some extra info to mitigate these sorts of issues.

The question of course is what to put in there. I guess we could put some sort of boot counter in there.

a/ We can't steal that many bits

...and...

b/ a boot counter requires some sort of persistent storage, which probably means something in userland will need to pass that info down

Maybe we could have nfsdcld do that? We might have to set that part of the ctime to 0's until the daemon is up and running though.

It wouldn't be foolproof, but it could help mitigate the problem, and probably wouldn't be too costly or difficult.

chucklever commented 7 months ago

[J. Bruce Fields 2022-01-21 16:40:33 UTC] There's some question how much time we want to invest in NFSv3.

But do you see any reason we couldn't mix the i_version into the low bits of the ctime?

chucklever commented 7 months ago

[Jeff Layton 2022-08-02 16:22:47 UTC] That might help mitigate the problem. The difficulty would be in how to ensure that when you're folding those bits in that the ctime never appears to go backward. Since we do use a steadily increasing value for i_version, masking in the lower bits of i_version might be sufficient to do that.

chucklever commented 7 months ago

[Jeff Layton 2023-04-14 14:25:45 UTC] It's not so much that filesystems clamp resolution at 1 jiffy, but rather that they call ktime_get_coarse_real_ts64, which caches a ctime value and only updates it every jiffy.

In principle we could call ktime_get_real_ts64, but:

1/ it's less efficient to poll the actual clock every time

...and...

2/ it would make it look to the filesystem like the ctime is always changing on with writes, so we'd end up logging a lot more metadata transactions (which would be costly). With the way the filesystems use the coarse timestamps, they get to elide a lot of those now.

What I'm working on at the moment is a set of patches to make it use high res timestamps on a conditional basis. When the ctime is queried via getattr, we set a flag in the inode. If that flag is set on the next update, we'll use a high-res timestamp instead of the coarse value.

This should give us the holy grail, I think. We should be able to ensure that the ctime apparently changes when things change between getattr calls, but at fairly minimal cost.

The main problem at this point is where to put the flag, and how to ensure that we can do this on a per-fs basis.

chucklever commented 7 months ago

[J. Bruce Fields 2023-04-14 14:30:35 UTC] "We should be able to ensure that the ctime apparently changes when things change between getattr calls"

As with changeattr, though, behavior across reboots can be more complicated.

chucklever commented 7 months ago

[Jeff Layton 2023-04-20 15:53:17 UTC] Absolutely. This shouldn't make anything worse.

Also, handling the ctime across a reboot is a safer proposition than with i_version. If you crash and the i_version has an apparent rollback, then you could end up with an i_version collision when a new and different change causes the value be bumped again.

For that to happen with the ctime, you'd have to crash and roll back, and suffer a juuuust right RTC rollback at the same time. I think that's a little tougher to contrive.

My latest draft patchset for a high-res c/mtime uses the lowest bit in the tv_nsec field as a QUERIED flag (much like we do with i_version). If we're willing to sacrifice some more bits of ts granularity, then we have a little space to store some extra info to mitigate these sorts of issues.

The question of course is what to put in there. I guess we could put some sort of boot counter in there.

a/ We can't steal that many bits

...and...

b/ a boot counter requires some sort of persistent storage, which probably means something in userland will need to pass that info down

Maybe we could have nfsdcld do that? We might have to set that part of the ctime to 0's until the daemon is up and running though.

It wouldn't be foolproof, but it could help mitigate the problem, and probably wouldn't be too costly or difficult.

linux-nfs / nfsd