pandorabox-io / pandorabox.io

Pandorabox infrastructure code
https://pandorabox.io
31 stars 4 forks source link

`digiline_routing` crash #516

Closed OgelGames closed 4 years ago

OgelGames commented 4 years ago
2020-05-26 12:49:32: ERROR[Main]: ServerError: AsyncErr: ServerThread::run Lua: Runtime error from mod 'default' in callback item_OnPlace(): stack overflow
2020-05-26 12:49:32: ERROR[Main]: stack traceback:
2020-05-26 12:49:32: ERROR[Main]:   [C]: in function 'get_name_from_content_id'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digilines/util.lua:133: in function 'get_node_force'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digilines/internal.lua:31: in function 'getAnyOutputRules'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digilines/internal.lua:44: in function 'rules_link'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digilines/init.lua:52: in function 'receptor_send'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digiline_routing/diode.lua:25: in function 'action'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digilines/internal.lua:106: in function 'transmit'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digilines/init.lua:53: in function 'receptor_send'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digiline_routing/diode.lua:25: in function 'action'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digilines/internal.lua:106: in function 'transmit'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digilines/init.lua:53: in function 'receptor_send'
2020-05-26 12:49:32: ERROR[Main]:   ...
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digilines/init.lua:53: in function 'receptor_send'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digiline_routing/diode.lua:25: in function 'action'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digilines/internal.lua:106: in function 'transmit'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digilines/init.lua:53: in function 'receptor_send'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digiline_routing/filter.lua:62: in function 'action'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digilines/internal.lua:106: in function 'transmit'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digilines/init.lua:53: in function 'f'
2020-05-26 12:49:32: ERROR[Main]:   .../worldmods/monitoring/monitoring/metrictypes/counter.lua:46: in function 'receptor_send'
2020-05-26 12:49:32: ERROR[Main]:   /data/world//worldmods/digistuff/switches.lua:19: in function 'on_rightclick'
2020-05-26 12:49:32: ERROR[Main]:   /usr/local/share/minetest/builtin/game/item.lua:426: in function </usr/local/share/minetest/builtin/game/item.lua:419>
BuckarooBanzay commented 4 years ago

@Lejo1 do you have time to look at this and maybe #515? (I don't know if i get to it the next few days :confounded:)

Lejo1 commented 4 years ago

I think it's one and the same problem because if the modules doesn't drop they create an overflow. Will fix it.

S-S-X commented 4 years ago

Seems that someone else also discovered cloning. It is simple message feedback loop around nodes. Same thing that burns multi luac network if you dont check who sent message and if you have to respond it. Components keep answering back to each other and finally component that originally sent message will answers again to that same message it first generated.

So this is reason why it can literally generate thousands of items within single cycle (for example 1 button press). Components will do same thing multiple times per each node.

Not sure how to explain it well but it is like infinite busy loop built using digiline components, imagine if digiline wires wont ever stop forwarding message you throwed in and you build loop using those wires. It is reason why digiline_routing uses dig_node as an attempt to mitigate it.

Lejo1 commented 4 years ago

The Problem was that dig_node doesn't bypass protection...

S-S-X commented 4 years ago

It would also be good to find proper way to fix it, remove_node hack does work to mitigate issue but does not really fix it. Not sure if it is possible to fix issue using only custom digiline rules for wire, recepetor and effector so that digilines would handle messages instead of sending messages again but that's what I hoped for when opened https://github.com/numberZero/digiline_routing/issues/9

Lejo1 commented 4 years ago

It would also be good to find proper way to fix it, remove_node hack does work to mitigate issue but does not really fix it.

It's pretty much fixed. All other bugs aren't up to digiline_routing but are general digilines problems.

S-S-X commented 4 years ago

It's pretty much fixed.

No, it is not fixed at all. Problem with message loop is still there even while it is mitigated to not crash server by removing nodes when they overheat because of too many messages. That does not fix actual problem, it only makes it harder to intentionally crash server.

Quick googling says that LUA_MAXCALLS max stack size is 255 and that's not much if you count 20 calls per node and active mapblock is 16x16x16, without problems you can grow lua callstack easily well over 255 frames.

All other bugs aren't up to digiline_routing but are general digilines problems.

Try building similar setup from just digiline wires, will it crash server because of callstack getting filled? No it wont. Will it still pass messages? Yes it will.

Lejo1 commented 4 years ago

Quick googling says that LUA_MAXCALLS max stack size is 255 and that's not much if you count 20 calls per node and active mapblock is 16x16x16, without problems you can grow lua callstack easily well over 255 frames.

Yep that's not very much but anyway enough for normal setups

Try building similar setup from just digiline wires, will it crash server because of callstack getting filled? No it wont. Will it still pass messages? Yes it will.

That's only because this mod uses the full possibility of digilines and the others don't. I think this should be handled by digilines either by delaying or by limiting.

S-S-X commented 4 years ago

@Lejo1 digilines does not do things that digiline routing does. digiline_routing does it wrong. Here's how digilines does transmit over whatever count of wires without growing stack at all while digiline_routing causes recursion: https://github.com/minetest-mods/digilines/blob/master/internal.lua#L99-L124

Similar thing should be done in digiline_routing mod because it defines low level wire like components, digiline_routing causes recursion because it is not designed that well.

It is true that digilines mod could add queue or async operations for receptor and effector but that would add significant overhead to every receptor and/or effector action call, there is API for digilines that other mods can use but it does not mean that digilines mod should attempt to fix all problems in other mods (like digiline_routing recursion)

Yep that's not very much but anyway enough for normal setups

That is max C code call frame stack size... not some max size for digiline component setups... it only relates to digiline_routing because that mod causes recursion which fills stack and finally causes stack overflow.

Anyway, I am trying to look for proper fix to remove call recursion problem.

S-S-X commented 4 years ago

Simply put, this is essentially what digiline_routing does:

function foo()
  foo()
end

then you call that foo() by sending digiline message

Lejo1 commented 4 years ago

I know, I'm just looking for a long-term solution but droping them is a good way to prevent that on short term.

S-S-X commented 4 years ago

Probably simplest solution is to call through minetest.after to break recursion and let digilines.transmit handle message forwarding and next node rules. Heat values can be removed after that, I do believe that overheat was actually added because of recursion and lack of interest to investigate deeper as overheat solution works for simple things.

However "fix" by overheating has problem that each node adds possibility to add +20 call frames to stack and therefore hard limit of 255 is just 13 nodes * 20 messages = 260 in stack. And that is case only if it was simple loop like above example ^^ but it is not.

Breaking recursion (and transmit queue) with async operation also allows mesecons limits to work properly and calculate actual penalty for each mapblock.

Here's simple example how it would work:

        digiline = {
                effector = {
                        action = function(pos, node, channel, msg)
                                function async()
                                        digilines.transmit(
                                                { x= pos.x - 1, y = pos.y, z = pos.z },
                                                channel,
                                                msg,
                                                {[minetest.hash_node_position({ x= pos.x + 1, y = pos.y, z = pos.z })]=true}
                                        )
                                end
                                minetest.after(0.2, async)
                        end,
                        rules = {{x=1,y=0,z=0}},
                },
                receptor = {
                        rules = {{x=-1,y=0,z=0}},
                },
        },

(Example for diode operation, for complex nodes each input position should be added to checked position hash table to prevent direct transmit loop ~and therefore 0.2 second timer using diode or other digiline_routing component)~ .edit: actually this wont prevent building timers as previous state of transmit queue is lost, it just needs chaining more than 1 component but it will still allow enforcing mesecons limits.

Do you see any problems with that solution?

Lejo1 commented 4 years ago

Do you see any problems with that solution?

If you add an check if the node still exist, not.

S-S-X commented 4 years ago

Do you see any problems with that solution?

If you add an check if the node still exist, not.

not needed as transmit already does tha. no reason to add redundant code

Lejo1 commented 4 years ago

Even better :+1:

S-S-X commented 4 years ago

So that would now need to get correct: receptor.rules effector.rules digiline.transmit first argument pos which is pos offset by receptor.rule (signal output)

Table containing loopback cutter rules, this one: {[minetest.hash_node_position({ x= pos.x + 1, y = pos.y, z = pos.z })]=true} where positions in table should be positions based on pos argument offset by all effector.rule (input) positions.

diode_rules_in and diode_rules_out functions can be used to get all data, I just removed those so that core idea is easier to read and understand without looking through other code.

BuckarooBanzay commented 4 years ago

thanks @Lejo1 for the patch! :+1:

S-S-X commented 4 years ago

BuckarooBanzay pointed out that

lua has tail call optimization in some cases

If that works in this case then overheating might be enough to fix this issue.

Flow goes like this (simplified): digilines.transmit -> diode.digiline.effector.action -> digiline:receptor_send -> digiline.transmit

Looking quickly through code digiline.transmit cannot be optimized for tail call recursion.

Async calls through minetest.after creates another problem which is possible timer stacking and can therefore cause more problems without state variable for timer, adding state variable would mean having position hash table containing timer state for each diode. Not good plan, I'd say test what happens and fix rest of it if it still causes problems.

Now I also tested stack size with simple recursion and it seemed to be 21826 for mod code for my minetest not 255 that I found from internet earlier so with overheating will be enough as nobody can really build large enough thing to overcome that stack size.

Also tested tail call recursion and it works properly as this does not cause stack overflow but instead keeps minetest in busy loop until killed:

function x(i)
  return x(i+1)
end
x(1)
numberZero commented 4 years ago

However "fix" by overheating has problem that each node adds possibility to add +20 call frames to

Hey, how? Are you sure there are 20 nested calls per node?

digiline_routing causes recursion because it is not designed that well.

It does, but before blaming for a poor design tell me please, do digilines allow messing with their BFS loop?

numberZero commented 4 years ago

If you fear server crashes, try this in a LuaC:

x="."
for k = 1, 64 do
    x = x .. x
end
S-S-X commented 4 years ago

Hey, how? Are you sure there are 20 nested calls per node?

I think it was default overheating limit for digiline_routing components and 20 calls per node is based on that. And yes I am sure about that as I've counted those and before problems with protection were fixed I did also build demo machine for cloning hundreds of digiline_routing components / digiline message sent so yes I tested that in game and verified behavior in code.

So actually there's a lot more nested calls than 20 per node but each node allows ~multiplying~ addition of whatever count there is by ~factor of~ 20 but that count is not really core of the problem. It would be if max call stack would have been 255 as I did say at first but as already pointed out it is not that small number and it would need huge setup to cause stack overflow if overheating works correctly.

Actually now that I checked that it seems to be 50 for filter and splitter, 20 for diode. That value is basically limit how many times you can call through overheat method in tight loop without removing node.

If you fear server crashes, try this in a LuaC:

Memory limit prevents these things and many other problems will be catched as luac creates sandbox for code excution, disables JIT and checks calls to add timeout. So should work fine.

before blaming for a poor design tell me please,

At least I think that I'm trying to find different possible solutions, tell about those to get feedback and trying to find problems from solutions more than just blaming anything.

That was probably about this line:

digiline_routing causes recursion because it is not designed that well.

Sorry if that felt bad, did not mean it like that but still I do have opinion and it is that I think it was not designed that well because those components do not have any means to break infinite recursion other than overheating and dropping nodes as items. I think finding problems is good way to solve problems, did you notice that most critique I've given here is about my code and I can call that it is also not well designed. I do not mean to blame anything but instead open discussion about possible problems, strong opinions usually helps there.

do digilines allow messing with their BFS loop?

It wont other than messing up visited nodes via checked argument (last arg for transmit) and that does not completely solve problems with timer based solutions I suggested earlier for breaking up recursion. And on top of that timers do bring more problems in like possible message duplication which in turn causes stacking of timers.

However as there's clearly a lot more space in stack this probably is not that big problem anymore because practically nobody is going to build setup large enough to overcome call stack and that is most probably going to be enough to mitigate recursion issue.

There might possibly still be other issues caused by this as it will still allow generating huge amount (compared to any other device) of digiline messages.

All that said I think it is worth to see if there's any real problems (crashes, performance) now that overheating works properly unless overheating was already working when this issue was opened... I think it was not yet fixed back then.

numberZero commented 4 years ago

Memory limit prevents these things

Oh, okay, it wasn’t the case IIRC. So it doesn’t crash now. Nevertheless it loads the server so heavily it can’t do anything else for significant time (remember MT is mostly single-threaded), and doesn’t even overheat so can repeat...

demo machine for cloning hundreds of digiline_routing components

Such cloning is a real bug, in any circumstances. Thanks for spotting that.

each node allows multiplying whatever count there is by factor of 20

That means 128M for just 6 nodes, for example. Are you sure?

S-S-X commented 4 years ago

Memory limit prevents these things

Nevertheless it loads the server so heavily it can’t do anything else for significant time

That's matter of perspective, if you think that few hundred microseconds is significant then yes but I don't think that's really significant... there you will hit instruction limits first long before loop finishes.

That means 128M for just 6 nodes, for example. Are you sure?

No, not really. Meant to say same thing as before: each node adds possibility to add +20 call frames not multiplying.

I've also added my latest possible suggestion on how to fix things to https://github.com/numberZero/digiline_routing/issues/9 Suggestion was solution based on updating design in a way that considers possibility to add few more API features for digilines mod so that wire like components can actually behave like digiline wires.

But for this issue about stack overflow happened on server it seems that mod was updated to use @Lejo1's fork after this issue was opened: https://github.com/pandorabox-io/pandorabox-mods/commit/bafebc3c8b43378b4266b83e6e1045ccbe0008ff Closing this issue as stack overflow probably have been caused by someone building similar component cloner that I built on test server. Overheating fix provided by @Lejo1 should be enough to mitigate that. Reopen or better create new clean ticket if that happens again.

I think recursion problem and design choices do not belong here, there's that another ticket, also linked to this one, open for discussing those things: https://github.com/numberZero/digiline_routing/issues/9

Lejo1 commented 4 years ago

Yeah I also think this should be enough first. I also tried to crash the game with my fork and I placed tooooooo much diodes until I started to use worldedit which than were able to crash the game, but that were much more than 60 diodes sooo this shouldn't be a real problem.