Open bengardner opened 6 months ago
I added DeadlineQueue.lua as a gist to illustrate the approach. https://gist.github.com/bengardner/77795ba08c89bf9da6dd4b4e1a1e5466
I was going to try something similar once I had a base big enough to encounter performance issues. My idea is based on this thread from the item network mod. The gist is that you have a fast queue and a slow queue (in the case of ARR these would be a fast and slow queue for each of the existing queues) and move entities between them based on whether or not they are active. I think I got as far as having the handlers return true if they did some work.
This of course would have the issue of entities taking a while to "wake up" after being placed in the slow queue. As well as requiring lots of tuning to figure out the optimal rates for the queues. The deadline queue would probably solve this, as the entity handlers would control when their next service happens.
It doesn't appear to vary the service rate.
It does, the queues are constant latency queues, meaning the service rate will increase if there are more entities such that every entity is serviced at a constant interval.
This produces something akin to ordered dithering, so having to service 10 entities with an interval of 20 ticks would produce an update pattern of {1, 0, 1, 0, ...}
.
I'm not sure how your deadline queue idea would deal with spreading services across updates, I think we'd need a to limit the max number of updates per tick based on how many entities there are in the queue - the idea being that if there were, for example, 1000 entities that want to update now, we delay some of them to prevent lag spikes, then over time the update rate would end up somewhat consistent.
Anyways, I'd be happy to merge your changes if they bring a measurable performance improvement to the mod.
I'm not sure how your deadline queue idea would deal with spreading services across updates, I think we'd need a to limit the max number of updates per tick based on how many entities there are in the queue - the idea being that if there were, for example, 1000 entities that want to update now, we delay some of them to prevent lag spikes, then over time the update rate would end up somewhat consistent.
The number of entities serviced per tick is limited, so the load will naturally spread out.
Let's go with a real example. Assume that the entities per tick is set to 100. I have a map that consists of 1000 assemblers.
We just started up and reload_entities()
was called. All 1,000 entities are now sitting in the +120 tick bucket. We service nothing for 119 ticks and then we hit that bucket. At tick 120, we will service the max of 100 entities.
They will get loaded into the tick +240 slot. At tick 121, we service the next 100 entities. Those get put in the +241 slot.
At tick 122, there is another 100 down. Etc. At tick +130, we finally reach the slot for +121 (which is 9 ticks late). There is nothing to do, so we look at the next expired slot, etc, until we catch up on +130. On tick +131, we are empty again. That continues until tick +240, where we see those first 100 entities that were processed at +120.
It is all smoothed out.
The max entities to process per tick needs to be user adjustable to match the performance of the machine.
On my machine, the UPS starts to drop around 100 entities per tick. I'd expect most to start lagging a bit around 50 entities per tick. The default for Item Network was 20 per tick. A bit conservative, but that was plenty with my variable timing modifications.
With ARR, an assembler has a desired tick service period of 120 ticks. If the service period doesn't vary, and assuming I pick a max per-tick rate of 100 (for easy math), then I can service 12,000 assemblers before the period drops below the desired rate. Beyond that point, some servicing will be delayed. At 24,000 entities, the rate will be 240 ticks. 48,000 goes to 480 ticks, etc.
There are two ways to support more entities -- either increase the # per tick (at the cost of UPS) or we extend the service period to longer than 120 ticks. In my fork of Item Network, I took the approach of making the service period as large as possible. Due to the priority scheme, the max service period was ~10 seconds. Most entities stayed around the max. In that scheme, I could also have around 12,000 entities (10 x 60 x 20) before the service period was beyond the desired amount.
This scheme doesn't really have benefits until the service period is extended. Or rather, it has a consistent service rate until overload. For example, I adjusted the ingredients added to an assembler based on the actual service period. And I extended the service period as must as possible. The upper limit was running out of ingredients or filling the output. Some assemblers would go the full 10 seconds. Others needed to be serviced quicker.
I see what you mean about the dithering. That solution is quite clever. I'll admit that I had to plug it into a spreadsheet before seeing how it worked. I don't know if that could be used if the service rate is variable.
I agree that the number of serviced entities per tick should scale with the total number of entities. It is a bit more complicated with a variable service rate. Maybe some sort of running average? With a configurable minimum and maximum values? Or maybe a simple adaptive limit that varies based on how many were serviced in a tick? For example, if we hit the limit before the last entity for the current slot, then increase the rate by 1. If we run out of entities before hitting the limit, then lower it by 1. If the slot was empty, then lower by 2. Keep the rate bounded between, say, 20 and 50. I'll have to try it. But we won't really see any useful results until the handler functions are changed to be adaptive.
It could be a running average, but that would involve keeping a table of the previous N number of entities it's serviced, but for a similar computational cost it could calculate the average of the next N buckets in the queue.
Something that I've used before for calculating a weighted average of sorts is something similar to your adaptive limit idea: avg = avg * 0.99 + x * 0.01
, where x is the new data point, in this case the number entities in the queue before we started processing it.
Though it is hard to tell how such an algorithm would work in practice, so I built a crude simulation to see how it would look when started with 1000 entities scheduled at tick 120.
These are all using my adaptive limit idea (with the weight of the new term being 1/120), which is plotted alongside the average of the previous 120 ticks. It seems to follow the average reasonably well for being a single calculation.
This will take quite a while to stabilise, and I don't really like how it constantly has to run behind by a couple ticks (because the only smoothing mechanism is delaying entities) - so what if I tried the opposite?
My fix for this was to reduce the number of extra entities that can serviced by how far in the future the next set of entities is, so if we have extra capacity for 50 services, but the next set of services is due in 10 ticks, then we can only service 40 extra entities. It seems to solve the issue reasonably well.
Now the only problem is that because it only focuses on increasing the number of items in a queue to meet the average, small spikes above the average take a while to flatten out. Which brings me to the last version of what I came up with.
The last two do have the downside of always running a couple ticks earlier than they should, but it's more desirable than being late.
These are pretty data visualisations, but how long do they actually take to stabilise? I plotted the standard deviation of the number of entities processed by each of the algorithms over time (a sliding window of 120, for 3600 ticks):
The combination algorithm does cause a faster smoothing but doesn't achieve a better final result, so it's probably better to stick with something similar to 3 (process extra items to meet the average). Not sure how it could be tweaked to get rid of the small hump that can be seen at the end of the video, maybe incorporate the maximum value into the average somehow? Or perhaps it's good enough for now.
Though this is a simulation of one type of entity, I don't know how it will fare with multiple types with changing delays in a closer to reality simulation.
I also had the idea to plot each entity separately in the colours of the rainbow so that any shuffling would be visible, but didn't have the time to try that out.
I put the code used to generate these in a gist: https://gist.github.com/udf/8b156d8c9fc6f7ae0778a2194aad9108
Really interesting visualization. Super-helpful. I played with it over the weekend and found something that is simple and works quite well.
The 'entities-per-tick' is the true average over the last 180 cycles plus a factor based on how far behind we are.
function EntityManager.on_tick()
local ra = global.deadline_average
if ra == nil then
ra = RunningAverage.new(180)
global.deadline_average = ra
end
local now = game.tick
local dlq = global.deadline_queue
local ept = Util.clamp(math.ceil(ra.get_average() + TICK_FACTOR*(now - dlq.deadline)), SERVICE_PER_TICK_MIN, SERVICE_PER_TICK_MAX)
local processed_cnt = 0
for _ = 1, ept do
local entity_id, _ = DeadlineQueue.next(dlq)
... handle entity and re-queue, increment processed_cnt ...
end
ra.add_sample(processed_cnt)
end
The average length and tick factor can be adjusted, but average length=180 and TICK_FACTOR=1 seems to work well.
My DeadlineQueue implementation pulls up to that many entries off the queue for each tick. It 'falls behind' when the deadline for the current queue is in the past. The tick factor helps 'catch-up' without messing up the running average too much.
The running average is computed using the usual circular queue implementation, where the CPU cost is constant regardless of the number of samples:
local RunningAverage = {}
function RunningAverage.new(size)
if size < 1 then
size = 1
end
local inst = {
sum = 0,
idx = 1
}
for i = 1, size do
inst[i] = 0
end
return inst
end
function RunningAverage.reset(self)
self.sum = 0
end
function RunningAverage.add_sample(self, value)
-- subtract the old sample and add the new sample
self.sum = self.sum - self[self.idx] + value
self[self.idx] = value
-- point to the next sample
self.idx = 1 + (self.idx % #self)
end
function RunningAverage.get_sum(self)
return self.sum
end
function RunningAverage.get_average(self)
return self.sum / #self
end
return RunningAverage
I'll see if I can clean up the script and post it tonight.
That running average code is really clever, I never thought of doing it like that. I like the queue processing algorithm too.
I uploaded the files here: https://gist.github.com/bengardner/34d5746ef13c261139636a4782a4d14f
dlq_simulation.py
has a function named preload_entities
that will let you tweak the entities a bit.
Servicing items only when they need it can save a bunch of CPU time.
The Item Network has a unified service queue. It would pull off and service 20 items per tick. That scales really well, but the service rate gets a bit slow. 2000 entities at 120 entities per second => 16.6 seconds between service. It was observed that most items don't need to be serviced that often. We considered using multiple queues to separate out high-priority items from low-priority.
My fork added a priority based queue, with 32 queues that span 20 ticks each. Servicing 20 entities per tick means 400 items per queue or 12,800 entities before it can no longer keep up. In my game runs, I didn't get near that limit, so I deemed that 'good enough'.
The max time between services was ~10.6 seconds (32 * 20 / 60). Most items don't need to be serviced more often than every 10 seconds. Some could go even longer. Some need to be serviced every few seconds.
The entity queue is a simple circular queue of tables, each with a 'deadline'. Each entity has a priority and the bucket is determined by the priority. If the current table index is 1, then priority=0 would put it at index 2, priority=1 is at index 3, etc. Entities are pulled from the current queue until empty AND the deadline has been reached. Then it goes to the next queue index.
Pretty simple stuff, but I had trouble fine-tuning the priorities. I ended up adjusting the priority up or down by 1 depending on the state. It would also peg the priority to the max in some instanced. For example, if a provider chest was full at service time and was empty after service, the priority would be increased (-1), as it is filling faster than it could be emptied. If the chest was over half-full and then emptied, then the priority would stay the same. If it was less than half-full, then the priority would be lowered (+1). If it couldn't be emptied (global item limit), then the priority would go to the lowest value. Assemblers were treated in a similar manner. Ingredients were calculated based on the elapsed time since the last service. The priority was only increased if the input ingredients were maxed out and still ran out before service. Likewise with the output, If it was full, then the priority would increase to service the assembler more often.
Since then, I changed the Service Queue again to be a tick-based deadline queue. When an entity is queued, it will be serviced on the requested tick or later (due to overload.)
Anyway, this is getting a bit long, but ARR is using a bunch of independent service queues. It doesn't appear to vary the service rate. The net effect is that this will run into performance issues when there are too many entities.
I'll probably end up ripping out the current queues and replacing them with a unified queue at some point in my fork. That change will break any hope of keeping my fork in sync. Same thing happened with my fork of Item Network. My changes were too much to merge back upstream.
Anyway, so my question is: would a unified deadline-based service queue this be something of interest for this mod?
We'd have to:
If interested, I'll try to implement it in small review-able chunks. The various queues will be merged one at a time.