linux-nfs / nfsd

Linux kernel source tree
Other
0 stars 0 forks source link

Server-side persistent NFSv4 sessions #5

Open chucklever opened 6 months ago

chucklever commented 6 months ago

This was bugzilla.linux-nfs.org 356

[Chuck Lever 2020-12-08 20:15:16 UTC] Remembering NFSv4 sessions across server restarts can help speed up state recovery and close the window on certain edge cases. There is a latency cost to maintaining session state on durable storage, however.

Remembering NFSv4 sessions across server restarts can help speed up state recovery and close the window on certain edge cases. There is a latency cost to maintaining session state on durable storage, however.

chucklever commented 6 months ago

[J. Bruce Fields 2020-12-08 21:04:12 UTC] Seems like low-latency storage is getting easier to find these days. Also, I think the operations we care about already require a commit to stable storage. (Is there an exception?)

Note that if knfsd just put its current duplicate reply cache in stable storage, that wouldn't be quite be enough:

Suppose for example the server gets a CREATE call for a new directory FOO, records the fact that the rpc is in progress, calls vfs_mkdir(), then crashes.

When it comes back up it gets the replayed CREATE request, matches it to the DRC record, runs vfs_mkdir() again, and gets EEXIST. Is that because our previous vfs_mkdir succeeded, or was FOO already there? (Or was it created by someone else at the same time?)

It's exactly the same problem the client faces when it resends the call and gets EEXIST.

So, I'm not sure what to do. I think some cooperation from the filesystem is required.

chucklever commented 6 months ago

[J. Bruce Fields 2020-12-08 21:30:56 UTC] See https://tools.ietf.org/html/rfc5661#section-2.10.6.5 for the relevant RFC language.

The paper they reference there also looks interesting: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.33.7106

As described in sections 3.1 and 4.1, their solution is to store DRC entries in the filesystem log.

chucklever commented 6 months ago

[J. Bruce Fields 2020-12-08 22:03:46 UTC] I'm surprised by this language in the spec: "The execution of the sequence of operations (starting with SEQUENCE) and placement of its results in the persistent cache MUST be atomic."

Consider a persistent session implementation that works something like this:

When compound processing reaches a nonidempotent op, the server records to the filesystem log, along with the filesystem operation itself, a DRC entry including the session ID, slot ID, sequence ID, and xdr encoding of the response so far.

On recovery after a crash, when the server receives the replay, it looks up this entry in the filesystem log, copies the partial xdr encoding into its response buffer, and continues where it left off.

The execution won't be at all atomic, in the sense that the filesystem may process operations from other RPCs or non-NFS users concurrently. But it doesn't seem to weaken semantics compared to normal NFSv4.1+ behavior.

We've never before required servers to treat compounds as atomic filesystem transactions, and I don't think there's any need to start with this feature.

chucklever commented 6 months ago

[J. Bruce Fields 2020-12-08 22:56:06 UTC] A problem with the implementation sketched in the previous section is that it seems to require an unbounded filesystem journal.

The journal would be required to keep an entry for the most recent operation on each session slot. But any number of operations could be processed without touching a given slot, so assuming the journal is append-only, the journal size is unbounded.

I think that can probably solved by requiring the server to keep its own copy of the DRC in stable storage and only requiring the filesystem journal to store IDs generated from an integer counter. The counter could be used to cross-reference journal entries with DRC entries. IDs found in the DRC but not found in the journal could either be "too old" (so they represent changes that were already committed to the filesystem and removed from the journal) or "too new" (so they represent changes that never even made it to the journal before the crash. The server should be able to distinguish the two cases by integer comparison, as long as the journal just knows the largest ID it ever recorded.

The server can therefore copy the partially encoded result from its DRC and either retry the next operation (in the "too new case") or encode a successful result and continue (in the "too old" case).

I don't know, I probably need someone with a good understanding of filesystem internals to get this right.

chucklever commented 6 months ago

[J. Bruce Fields 2020-12-09 01:07:51 UTC] We'd also need to figure out how to handle the case where the server reboots after the filesystem has replayed its journal, but before clients have had a chance to replay their RPC requests.

chucklever commented 6 months ago

[J. Bruce Fields 2020-12-09 22:51:24 UTC]

We'd also need to figure out how to handle the case where the server reboots after the filesystem has replayed its journal, but before clients have had a chance to replay their RPC requests.

I think the server has to get some hooks into the journal replay process somehow so that it can rebuild the DRC and make sure its own stable storage before the filesystem throws away important information?

The NFSv4 working group says I'm interpreting that spec language unnecessarily strictly, that what it really means is that the compound execution has to be atomic with the encoding somehow, not that the compound execution has to be atomic; see "[nfsv4] persistent sessions and compound atomicity" and followups:

https://mailarchive.ietf.org/arch/msg/nfsv4/LAeppkhbUHD-P-9vK0TuaNIIzgI/

chucklever commented 6 months ago

[J. Bruce Fields 2022-01-10 21:15:04 UTC] Thought they likely wouldn't remember at this point, I did chat with Dave Chinner and other xfs people at the 2018 Utah LSF&MM meeting, and they were amenable to helping with the journaling work we'd need here, given a good specification of what we need from them. I didn't get around to writing up that specification.