monome / norns

norns is many sound instruments.
http://monome.org
GNU General Public License v3.0
614 stars 144 forks source link

race condition with metro callbacks on script cleanup #1033

Open catfact opened 4 years ago

catfact commented 4 years ago

see this thread: changing scripts results in some errors on metro callback defined in the old script: https://llllllll.co/t/cheat-codes-v1-2/30329/16?u=zebra

one guess: maybe the metro callback events are already in the main event queue when the script change happens.

i'm looking at cleanup in script.lua, and it looks like calls metro.free_all to stop all threads, but the metro tables in lua are untouched. when a metro callback event is pulled off the stack, it just uses a numerical index to look up the callback fn in the corresponding metro table.

so, possible solution: script cleanup should also set all handlers to no-op. (if it does this already somewhere i've missed, then nvm sorry.)

could also be related to #769

tehn commented 4 years ago

looks like the global state gets cleared/reset before the metros are freed:

https://github.com/monome/norns/blob/master/lua/core/script.lua#L45

i can PR this (but it may be hard to reproduce??) and it shouldn't cause any ill effects

tehn commented 4 years ago

nope that didn't work.

also clearing the metro events didn't work, so now i'm totally confused but still working on it.

catfact commented 4 years ago

what i'm saying is that metro.free_all is not sufficient:

function Metro.free(id)
  Metro.metros[id]:stop()
  Metro.available[id] = true
  Metro.assigned[id] = false
end

--- free all
function Metro.free_all()
  for i=1,Metro.num_script_metros do
    Metro.free(i)
  end
end

it calls down into C to stop the threads, and it makes the metros assignable again, but the tables are still there and still populated with handler functions.

clearing the metro events didn't work

you mean you tried setting each .event field? or actually clearing the event queue of metro events?

tehn commented 4 years ago

you mean you tried setting each .event field? or actually clearing the event queue of metro events?

sorry, yes, that's what i did. added

function Metro.free(id)
  Metro.metros[id]:stop()
  Metro.available[id] = true
  Metro.assigned[id] = false
  Metro.metros[id].event = nil
end

which would make me think any pending metro calls in the event loop would then abort when trying to call a nil event handler.

will be able to pick this back up later tonight

catfact commented 4 years ago

hm.. i guess another possibility is that the C layer is somehow still posting events after the metros are stopped.

it's not supposed to do that: metro.c:metro_stop() calls pthread_cancel, and the metro threads call pthread_testcancel() on wakeup before banging out an event. which AFAIK is the best we can do.

(TBH, i'm not sure when i will next have the brainspace for a thorough deep-dive and review of the metro.c synchronization sequences. the natural consequence of which would be "lets just rewrite this in c++ or Rust.")

can we repro with a script that just makes as many very fast metros as possible?

tehn commented 4 years ago

might be related to #769