Closed jdarcy closed 4 years ago
Not clear whether we can commit to having this done in the 3.12 timeframe.
Some earlier discussions on this subject: [1] http://lists.gluster.org/pipermail/gluster-devel/2016-January/047975.html [2] http://lists.gluster.org/pipermail/gluster-devel/2016-March/048539.html
@jdarcy What do you think of making a separate xlator for throttling instead of modifying io-threads to understand tags generated by another (tagging) xlator? I was thinking we should put all QoS related logic into one separate xlator, with options for changing the type of algorithm used. Does it make sense to use the token bucket filter approach ? (TBF is now a part of libglusterfs; I wrote some code to use it as an xlator on the brick to regulate self-heal traffic at a FOP level).
@raghavendrahg you might be interested in this.
I like the WFQ-type approach as a good starting point.
@jdarcy would the design support IOPS-based minimums? Thinking of the future, flash devices are much more consistent about IOPS, making QoS targets a bit easier. I think a model of min X IOPS w/ weights for excess would be a good long-term vision to get to.
@itisravi I'm not that enthused about creating a second translator to do queuing etc. Besides the overhead - including locks and allocations - of queuing twice, queuing for fairness and queuing to optimize back-end I/O are pretty closely related. Keeping all relevant information in one translator seems better to me. By contrast, the tagging doesn't incur the same kind of overhead or risk interfering with what we do in io-threads. It might also be useful for things besides scheduling, such as monitoring or implementing certain kinds of quota, so I'd say that does not belong in the same translator with the queue-management code.
@JohnStrunk I'm a pretty big fan of plain old-fashioned WFQ. A lot of the team is in love with (d)mclock, which is explicitly a WFQ variant but does support both minima and maxima as well. Fortunately, that part's easy to tweak later. Once we get the basic infrastructure - including configuration and metrics as well as the actual I/O path code - in place, switching between WFQ and DRR and anything else that comes along should be pretty straightforward.
In bitrot detection, throttling was needed to control the scrubbing. So a throttling mechanism was introduced with bit-rot detection. But the throttling is not a xlator and part of the bitrot scrubber as of now. And scaling up/down of the threads is admin configured. Based on the scrubbing moddle (normal, lazy and aggressive) threads are scaled up/down.
We can check if we can make the throttling mechanism a generic consumable thing (i.e. remove throttling from bit-rot scrubber and make it a part of libglusterfs)
On Mon, Jul 17, 2017 at 12:35 PM, Jeff Darcy notifications@github.com wrote:
@itisravi https://github.com/itisravi I'm not that enthused about creating a second translator to do queuing etc. Besides the overhead - including locks and allocations - of queuing twice, queuing for fairness and queuing to optimize back-end I/O are pretty closely related. Keeping all relevant information in one translator seems better to me. By contrast, the tagging doesn't incur the same kind of overhead or risk interfering with what we do in io-threads. It might also be useful for things besides scheduling, such as monitoring or implementing certain kinds of quota, so I'd say that does not belong in the same translator with the queue-management code.
@JohnStrunk https://github.com/johnstrunk I'm a pretty big fan of plain old-fashioned WFQ. A lot of the team is in love with (d)mclock, which is explicitly a WFQ variant but does support both minima and maxima as well. Fortunately, that part's easy to tweak later. Once we get the basic infrastructure - including configuration and metrics as well as the actual I/O path code - in place, switching between WFQ and DRR and anything else that comes along should be pretty straightforward.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/gluster/glusterfs/issues/266#issuecomment-315802460, or mute the thread https://github.com/notifications/unsubscribe-auth/AAmAbOldIDf-PGrzkgtodZUN1eWttaWUks5sO4wDgaJpZM4OZDId .
We can check if we can make the throttling mechanism a generic consumable thing (i.e. remove throttling from bit-rot scrubber and make it a part of libglusterfs)
@raghavendrabhat , http://review.gluster.org/14846 already did that, on top of which I started working on making it an xlator for throttling shd traffic.
Just wanted to paste what @jdarcy posted on #255 and continue the discussion here as more people are on this issue than that one.
What I'm currently leaning towards is a bit different than what Shreyas presented, in a couple of ways.
(1) Using weights/shares instead of absolute IOPS rates. I believe that an absolute-number approach scales poorly, as it requires that numbers for current I/O streams be adjusted whenever a new one is added (and when the system's performance changes). That works for a few streams, but not for dozens let alone hundreds.
Could you elaborate a bit more about how are the weights allocated per stream? I want to understand how weights are allocated to a stream which could potentially be doing costly FOPs like FSYNC.
(2) Single thread pool for all request queues instead of separate queues. Again, the issue is scalability. It's not feasible to have separate thread pools for hundreds of I/O streams (same issue exists for multiplexing BTW), and just dropping "bad actors" into a common least-pri pool doesn't provide enough flexibility. Also, if the pools are small enough to be effective for throttling, they're probably so small that capability gets stranded. Ideally, any one stream should be able to use the system's whole capability if the system is otherwise idle, but still back off quickly if a higher-priority stream appears.
It is actually the same thread pool which acts on all the queues now as well. It just scales based on the limits on each prio-queue.
Rest looks good to me. The only thing I sometimes wanted is priority-inversion, if we can pull it off, this time around, that would be great.
Separately, and a bit more strategically, I really want us to strive toward a single solution as much as possible. I think we can converge on a single queue-management methodology and implementation, whether it's TBF or DRR or WFQ or whatever. Can we also converge on a single instance of that implementation, so that any process only has one queue manager instead of several (e.g. for self-heal and rebalance and bitrot all interacting with one in io-threads)? I still think so, but I'm less certain. The problem with having multiple instances is that they interact in very hard-to-predict ways. When I've worked on such systems before, it has always been very hard - sometimes impossible - to set all of the knobs so that throttling under high contention is effective, and there always seems to be a side effect of stranding capability under lower contention. This chaotic-interaction problem becomes even worse when the different instances are controlling different resources (e.g. threads vs. individual requests) or even controlling one in very different ways. Even if that weren't the case, making users learn about several different throttling methods would be unwelcome.
Idea of consolidating the implementation completely in io-threads alone looks good to me as well. Making users learn different throttling methods is unwelcome. But what I am increasingly seeing is coming up with different profiles for different workloads. From that perspective, if say TBF gives better performance for one kind of workload where as WFQ gives for another. It would be nice if the implementation doesn't prevent setting different solutions for different workloads. But in a brick-group(i.e. mux process which loads some bricks) we can limit it to only one of these TBF/DRR/WFQ
I'm not saying that no other kind of solution is acceptable. If we need to implement point solutions as temporary workarounds, so be it. That's often an engineering and product necessity. However, I predict that such incrementalism will not lead us to a system that scales in the ways we need it to, and that we'll eventually be forced to implement a more comprehensive solution. At that point, a lot of the point-solution code will have to be abandoned. I don't like seeing anyone's hard work wasted, so as much as possible I'd like to push for applying that effort to creating a robust general solution.
I want to understand how weights are allocated to a stream which could potentially be doing costly FOPs like FSYNC.
I can't even imagine why you think fsyncs would be involved. Setting weights is exactly like setting absolute limits. At the point where they're set, they're just numbers. They only become different when we're making decisions about which requests to dequeue/issue, and then it's just some simple arithmetic calculation (per-stream requests vs. total requests in current round, compared to per-stream weight vs. total weight). In fact, I'd say a weight-based approach is slightly cheaper than an absolute-number approach, because it doesn't require keeping track of the absolute rates as requests complete - timing calls, more locking, and so on. All of the work is on the front end, where we're already trying to make decisions anyway.
It is actually the same thread pool which acts on all the queues now as well.
It sort of is and it sort of isn't. Yes, threads can move among different priorities, but the per-prio limits (high-prio-threads down to least-prio-threads) effectively segregate those threads into multiple pools - which might or might not overlap or add up to more than the actual threads available based on those numbers. That's where the tuning difficulties come in. Set low-prio-threads very low, and you'll effectively prevent those types of fops from crowding out others. (BTW dividing by type is kind of awful IMO but that's a separate discussion.) You'll also effectively prevent those kinds of fops from using the full capability of the system even when it's idle otherwise. If that work isn't done during the idle period, it will be forced to contend with others when the system gets busy again. Everyone's worse off. :( I don't think the "actually" really changes the conclusion.
if say TBF gives better performance for one kind of workload where as WFQ gives for another. It would be nice if the implementation doesn't prevent setting different solutions for different workloads
I don't think it's so much better/worse performance as better/worse fairness or capability utilization. Certainly we should be able to try different algorithms here. Maybe we should even be able to select one dynamically, i.e. implement a plugin system much like the Linux kernel has done both for networking and for CPU scheduling. Should we be able to arrange multiple instances hierarchically? I can see uses for that, e.g. if user I/O has 80% weight via plain WFQ (vs. 20% for internal ops such as rebalance) then it might be useful to subdivide that 80% among tenants using a different algorithm - e.g. d/mclock which can enforce maxima/minima as well as weights. Not sure I see many other use cases for that, and I'm not sure even that one will really gain us much, but if we implement this correctly it shouldn't take much extra effort to do the experiments and see if it's worth it.
May be what I understood by stream and what you are saying by stream could be different. Could you define what you mean by I/O stream? I was assuming it to be I/O from a client.
For whatever solution we are going to come up with, it is important for us to understand how the system will behave for other clients/tenants if there is one client/tenant which will do lot O_SYNC/FSYNC heavy operations. Will the system suck for others too, when these things are in progress? Can the solution we come up with guard against that?
A stream is sort of whatever we want it to be. Sorry if that's vague; I'll try to make it less so. Let's say that there's a queue-management component, say in io-threads. Somewhere above that is a component that's adding stream tags to requests. Currently we (at FB) do this in the NFS daemon, so it only works for NFS. We have an intern working on a translator that sits below protocol/server to do this for all access methods. In any case, these two pieces are almost oblivious to one another. The tagging component doesn't care what's done with the tags, and the queue-management component doesn't care where they came from.
So where did they come from? Right now each namespace (top-level directory) gets a tag. We could also associate tags with TLS identities. For internal operations, we could associate tags with the special negative PIDs we use for such. In any case, the key is that it's not strictly tied to a connection. There might be many connections getting the same tag or even different requests on the same connection getting different tags. It's up to us, via the tagging component, to decide what kinds of entities we want to balance between.
As for O_SYNC/fsync, now I see where you're coming from. As long as local filesystems suck with respect to fsync entanglement, that's going to be a hard problem for any QoS approach. (Something like BlueStore would make it easier.) The problem's going to be even harder, or even impossible to solve, with some approaches. I think any controls we apply client-side, or any controls on numbers of threads (instead of requests) fall into that category. That's why I favor a server-side request-based approach. From that position, we can do what networking QoS often does: assign different costs to different requests (packets for them). If a mkdir costs five points, maybe an fsync should cost ten, or twenty, or some variable number depending on how much has been written since the last one (though that would require more tracking). Then we're trying to make per-stream points / total points match per-stream weight / total weight, instead of per-stream fops. Users who run fsync-heavy workloads will be able to issue fewer of those or anything else, so their ability to perturb others will be similarly limited. It's not perfect, but if we can get the cost estimates right it should be about as close as we can expect to get.
As for O_SYNC/fsync, now I see where you're coming from. As long as local filesystems suck with respect to fsync entanglement, that's going to be a hard problem for any QoS approach. (Something like BlueStore would make it easier.) The problem's going to be even harder, or even impossible to solve, with some approaches. I think any controls we apply client-side, or any controls on numbers of threads (instead of requests) fall into that category. That's why I favor a server-side request-based approach. From that position, we can do what networking QoS often does: assign different costs to different requests (packets for them). If a mkdir costs five points, maybe an fsync should cost ten, or twenty, or some variable number depending on how much has been written since the last one (though that would require more tracking). Then we're trying to make per-stream points / total points match per-stream weight / total weight, instead of per-stream fops. Users who run fsync-heavy workloads will be able to issue fewer of those or anything else, so their ability to perturb others will be similarly limited. It's not perfect, but if we can get the cost estimates right it should be about as close as we can expect to get.
I am pretty sure assigning static weights is not going to give us good results because of the range of time it takes to complete fsync calls. I have seen fsyncs taking 4-5 minutes on the same system where it took 2-3 milli-seconds based on the amount of writes the system witnessed before fsync. So your next point looks better, i.e. assigning weight based on writes that happened before FSYNC. Another thing I observed is the syscall latency of fops that are coming at the same time FSYNC is in progress, they would suffer more latency too. I am not sure if there is a way to predict that.
With that said, do we want to solve this problem on HDD? Because it is extremely difficult to predict weights.
do we want to solve this problem on HDD?
I think we have to try. Many Gluster use cases are about cost, capacity, or aggregate throughput than latency, so they'll still align more with disk than flash. At the same time, at the scale necessary to achieve those kinds of capacity/throughput goals, sharing resources between users becomes pretty important. Nobody wants to have two million-dollar clusters each idle half the time, which a single cluster with the right software could substitute for both. I can assure you that we need to have this within Facebook, and I don't think we're anywhere near alone.
If fsyncs are taking four or five minutes, I would contend that either we or the local filesystem are not doing a very good job of resolving the impedance mismatch between client demand and back-end speed. This is essentially the same problem in storage as bufferbloat in networking. It's an interesting problem in its own right, but since it can affect even a single client/user/tenant perhaps it's orthogonal to the issue of ensuring fairness among many. Perhaps it should be a separate issue?
Is it a reasonable approach to QoS (for volumes but not tenants) by using cgroups(blkio) to limit iops of glusterfsd process? Dose a brick server have just one process (glusterfsd) to read/write disk?
Thank you for your contributions. Noticed that this issue is not having any activity in last ~6 months! We are marking this issue as stale because it has not had recent activity. It will be closed in 2 weeks if no one responds with a comment here.
Closing this issue as there was no update since my last update on issue. If this is an issue which is still valid, feel free to open it.
In any case where a brick daemon is serving requests from multiple tenants, whether it's traditional users with separate top-level directories in a single volume or hyperconverged workloads with brick multiplexing, the "noisy neighbor" problem needs to be addressed. Some efforts have already been made in this area.
https://review.gluster.org/#/c/14904/ avoids the worst kinds of starvation at a connection level.
Facebook has implemented various patches to throttle "misbehaving" hosts based on observed IOPS rates and manually configured limits, by dropping their request priority in io-threads.
Others have implemented partial solutions for DHT rebalance, bitrot, etc.
An ideal solution would recognize tenants across connections and access protocols, based on a flexible combination of UID (for internal operations such as self-heal and rebalance), brick, top-level directory, or explicit tag. It would also allow for assigning different weights and corresponding shares of total I/O for different tenants, instead of equal sharing. Lastly, approaches based on absolute IOPS numbers are undesirable, as a brick's IOPS rate tends to vary according to workload even when hardware remains constant. This often leads to under-utilization (if limits are set too low) or loss of throttling effectiveness (if limits are set too high).
To this end, work is already ongoing at Facebook to implement a tenant/namespace tagging translator (this is almost done) and mechanisms within io-threads to use these tags for throttling. The current approach is functionally equivalent to Weighted Fair Queuing, though the algorithm used to implement it is quite different.