Server-side dynamic session slot allocation

chucklever commented 7 months ago

[J. Bruce Fields 2022-01-11 16:44:39 UTC] One of the problems with the old DRC was that there was no way for the server to limit client concurrency or lifetime of DRC entries, therefore no way to limit DRC size.

With sessions, on the other hand, we can limit number of slots, and we know when a slot is retired and available to be reused. We negotiate the number of slots at CREATE_SESSION time, so can fail a mount up front if we available memory wouldn't allow us to provide that client exactly-once semantics.

The decision as to how much session DRC to grant to a client is a memory/performance tradeoff. Our current implementation is mainly in nfsd4_get_drc_mem and set_max_drc. It tends to err on the conservative side. We've gotten complaints over the years of mounts failing unnecessarily. We probably also grant clients fewer slots than they could use in some cases.

One of the possible optimizations we could make here is dynamic slot allocation: the RFCs allow the server to grant or take away slots after CREATE_SESSION, so it can allocate slots to clients as they need them instead of having to guess everything up front. See the discussion of highest_slotid in https://datatracker.ietf.org/doc/html/rfc8881.html#section-2.10.6.1

Trond posted some patches in 2012 that didn't get merged: https://lore.kernel.org/linux-nfs/1354159063-17343-1-git-send-email-Trond.Myklebust@netapp.com/

The thread at https://lore.kernel.org/linux-nfs/1529598933-16506-2-git-send-email-manjunath.b.patil@oracle.com/ also has some further discussion. Also note, from Trond, "if I were redoing the patches today, I'd probably change them to ensure that we only grow the session slot table size using the sequence target_slot, but shrink it using the CB_RECALL_SLOT mechanism. That makes for a much cleaner implementation with less heuristics needed on the part of both the client and server."

chucklever commented 7 months ago

[Chuck Lever 2022-02-20 17:02:57 UTC] Copied from https://lore.kernel.org/linux-nfs/194829a2c280364faa6e9c70dbaee463101453a7.camel@hammerspace.com/T/

IIRC the only downside to a large default slot count on the server is that it can waste memory, and it is difficult to handle the corner cases when the server is running on a small physical host (or in a small container).

I would have a small default slot count (one page of slots??), which automatically grew when it reached some level - say 70% - providing the required kmalloc succeeded (with GFP_NORETRY or similar so that it doesn't try to hard). It would register a "shrinker" so that it could respond to memory pressure and scale back the slot count when memory is tight.

Freeing slot memory would be not be quick as you might need to wait for the client to stop using it, so allocating new memory should be correspondingly sluggish.

Shouldn't be too hard.... Definitely don't want a tunable for this.

I would add that a shrinker seems like the correct architecture. However I don't believe the shrinker callback would have to wait. It could immediately free any slot table memory that was unused, and then reduce the maximum slot ID in existing slot tables. Eventually the maximum slot IDs would be small enough to enable any extra slot table space (above a single page, say) to be released if the shrinker callback is invoked repeatedly.

I'm not familiar enough with the server's current slot table data structure to say how easy it would be to increase or reduce its size.

chucklever commented 7 months ago

[NeilBrown 2022-02-21 05:40:18 UTC] Background/design: The client has minimal control of the number of slots in use. At session creation it can suggest a number, but the server is free to over-ride this.

During the session it can use any (idle) slots that the server has permitted but cannot ask for more. It is encouraged to use lower-numbered slots whenever possible, but the only way it can indicate that there aren't enough slots is to keep them all busy.

The server has ultimate authority over number of slots. While this should generally be a "last resort", it can use the NFS4ERR_BADSLOT error to prevent the client from using a slot.

There are two mechanisms for (more gently) requesting that the client use fewer slots. It can initiate a CB_RECALL_SLOT callback to suggest a maximum number of slots to use. It also set a preferred maximum in the reply to every SEQUENCE op. It isn't clear to me how CB_RECALL_SLOT is useful ... unless there are no SEQUENCE ops for the server to reply to.

The Linux NFS server currently allocates each "slot" separately (the size depends on negotiated parameters) and allocates an array of pointers to these slots. I think it would probably be appropriate to use an xarray to store these pointers, as this can grow and shrink dynamically, and lookup is fast.

We currently have a nfsd_drc_max_mem setting that aims to limit the total memory used for slots. I think we should deprecated that. Instead slots (apart from the first) should be allocated with __GFP_NORETRY, and a shrinker be registered to free slots.

New slots could be allocated for a given session when the client uses (close to) all of the slots allocated. The server doesn't (need to) track the number of slots in use, only the highest currently in use. As the client is supposed to use low slots first, it is probably safe to consider allocating more when any slot in the last 5% (??) is used.

When the shrinker indicates that some slots need to be returned, we need to consider each active session and trim them all a little bit. We could possibly register a separate shrinker for each session, but I'd like to avoid that.

slots can only be freed immediately if the sa_highest_slotid reported most recently by the client is less than the number of allocated slot (and hence presumably the sr_highest_slotid being returned by the server). These slots can be returned immediately, but if there aren't as many as the VM asked for, we would need to start asking clients to use fewer slots.

I'd be tempted to ask each session to reduce usage by a fraction of current usage, corresponding to the fraction requested by the shrinker callback. This would be done using SEQUENCE replies in the first instance, but a periodic job might send callbacks to idle clients.

When a SEQUENCE op confirms that the client has reduced usage, the slots can be freed, and then the session is allowed to try allocating again. If a given session never drops below the slot target, then ..... maybe we worry about that later.

Shrinker configuration has a magic number call "seeks" for which "2" is "A good number if you don't know better". It is a measure of cost. If one shinker reports are higher cost (higher seeks number) than another, then the one will be subject to less pressure than the other. I wonder if maybe we should use a higher seeks number of the slot shrinker, though I'm having trouble coming up with a clear justification. Partly it is because we grow the slot table gently - using __GFP_NORETRY, so it seems appropriate to request that it be shrunk gently too.

linux-nfs / nfsd

Server-side dynamic session slot allocation #28