Closed OgelGames closed 3 years ago
If going for average then is it possible to use https://github.com/mt-mods/technic/blob/fd4bece11a633e665cf73c782bb1d2032169ee88/technic/machines/network.lua#L382-L394 which already exists and almost sure outperforms anything else when it comes to overhead of calculating averages? It might be possible to drop max_lag.lua/lag.lua completely while reducing total lines of code and increasing performance.
However if it seems that doing it will require changes to sma function then probably should not be done as it can actually benefit from performance somewhat while lag.lua functions executed only once / globalstep wont benefit that much.
I don't think that should be used in this case, the method I used aims to sample every globalstep, while avoiding calculations being done every globalstep; instead they are only done when the functions are called (which is no more than once a second).
I don't think that should be used in this case, the method I used aims to sample every globalstep, while avoiding calculations being done every globalstep; instead they are only done when the functions are called (which is no more than once a second).
Yes, I think that fact makes it better to have separate function that calculates average for whole table instead of rolling average calculated for single added value is probably better even while table does contain 100 entries and average globalstep is only about 100ms favoring per value calculation more vs per call calculation cost but I think not enough to warrant changing it.
Also if someone wants to make average for get_avg_lag based only 10 values (or probably even for 20 values) then globalstep continuous would be worse and only benefit would be from spreading (very small) workload over longer time.
ok, this is nicer than my ugly hack :+1:
Just a heads-up: the dtime
value may never exceed 2 seconds: https://github.com/minetest/minetest/blob/e9bc59e376f88f1d4d1c6d3fedf62d9049e3e60d/src/server.cpp#L524-L526
the
dtime
value may never exceed 2 seconds
That's strange... Explains this though: https://github.com/minetest-monitoring/monitoring/blob/master/builtin/lag.lua#L28-L30
Tested a bit, generated 5 seconds artificial globalstep lag (busy loop in globalstep) and checked how this responds. Engine 4.9 seconds lag reported. Technic.get_avg_lag 0.158 seconds lag reported.
Multiple 15 seconds lag spikes, bit over 0.2 seconds reported by technic.get_avg_lag and engine reporting just bit less than 15 seconds lag.
This will change completely how technic globalstep lag check mechanism will actually respond to different lag spikes, it might be fine or it might need some adjustments. Maybe still make it respond bit faster by reducing size of table where values are collected, some 20 - 50 possibly?
Still remember that what I tested was probably not anywhere near any real situation but it still shows that response is maybe bit too slow especially compared to previous instant response. Reducing table size will also affect similar way when reducing high averages back to normal when actual globalstep lag gets down, response is faster and technic globalstep frequency gets higher bit sooner when lag is over.
As averages are collected over very long time it does not really detect high lag spikes (and respond fast) which can be useful when attempting to fix situation in game (you know, that happened many times).
For actual server testing maybe it could be useful to add current average lag values to output of command, possibly just add another value for technic.get_avg_lag here: https://github.com/mt-mods/technic/blob/64237957e472c35c81e257ed59026500098808ee/technic/machines/init.lua#L96-L99
So that running /technic_get_active_networks
would say something like:
Cached network data: 2 active networks, 4 total networks, 16 network nodes, 0.256s avg lag
... network data report rows ...
Tested a bit, generated 5 seconds artificial globalstep lag (busy loop in globalstep) and checked how this responds. Engine 4.9 seconds lag reported. Technic.get_avg_lag 0.158 seconds lag reported.
Hmm... maybe the method MT uses is better, but then that is a bit too slow to return back to normal after a short lag spike... 🤔
Hmm... maybe the method MT uses is better, but then that is a bit too slow to return back to normal after a short lag spike..
Yes, I think this new system is a lot better as max_lag returned by engine will take extremely long time to go down if it gets high enough. But some tuning is needed as it should really also get higher a lot faster, simplest way to make it do that (and also even return back to normal faster when lag is over) is to reduce table size making single value to affect average a lot more.
Other way to add faster response would be to scale added values by multiplying input (and counter this accordingly when reading averages), that makes higher values more significant.
Or calculate averages by using only say 10 highest values which would account more for short lag spikes, might or might not be useful... this could also be tuned for example to drop 40 highest and 40 lowest values and use 20 remaining to calculate average.
Maybe this would work? It's basically the same as the way the engine calculates max_lag, but it decreases faster. (* 0.9
instead of * 0.9998
)
local last_step = minetest.get_us_time()
local max_lag = 0
minetest.register_globalstep(function()
-- Calculate own dtime as a workaround to 2 second limit
local now = minetest.get_us_time()
local dtime = now - last_step
last_step = now
max_lag = max_lag * 0.9 -- Decrease slowly
if dtime > max_lag then
max_lag = dtime
end
end)
function technic.get_max_lag()
return max_lag / 1000000
end
Could also tie that multiplier to a setting too... 🤔
Maybe this would work? It's basically the same as the way the engine calculates max_lag, but it decreases faster. (
* 0.9
instead of* 0.9998
)
I think this would work fine, better than what engine does anyway. I however liked average values bit more however as it would smooth out occasional single server step lag spikes while still providing very useful averages.
For useful averages you don't really need many values to do that, network lag detection did it very well with just 3 values but still used 5 values (no specific reason for 5 values... it was just largest where I was not able to measure performance with get_us_time at all).
Could also tie that multiplier to a setting too... 🤔
Yes, would be good idea to allow easier adjustment.
So I decided the only way to figure out this properly was to test the different methods, and plot them on a graph:
max_lag
from the engine (pulled from server status),technic.get_max_lag()
in this PRtechnic.get_avg_lag()
in this PRmax_lag
calculated with a multiplier of 0.99.You were definitely right about technic.get_avg_lag()
not responding to lag spikes... 👀
So I decided the only way to figure out this properly was to test the different methods, and plot them on a graph:
Did not test in game again but I think this can be merged. While it does not provide any smoothing for very short lag spikes looking at graphs you provided this is clearly way better for technic lag control purposes than engine status parsing and it responds to lag spikes immediately just like engine status parsing did.
Maybe I'll add lag value to network stats command, not exactly required but can still be useful to confirm in game what is happening.
Lag is now calculated inside the mod, instead of being extracted from server status.
Fixes #166 and #191