monome / norns

norns is many sound instruments.
http://monome.org
GNU General Public License v3.0
633 stars 147 forks source link

clean up / document component management methods #1014

Closed catfact closed 4 years ago

catfact commented 4 years ago

recently there were some issues reported on the forum related to the way program lifecycle management works on norns (or doesn't work.)

it made me realize that the use cases and the final design concept are fuzzy to me, and i guess to most users.

here are a few things that can be used to reset parts of the norns stack:

in other words, i guess i think we need to clearly inform users about which method to use for what circumstance. (troubleshooting, recovering, updating etc.)? and we may want to refine the offerings. for example:

what do yall think?

catfact commented 4 years ago

oh another thought, sorry

we tell users with locked-up norns that the rude button is a method of last resort (correct), and imply that matron repl ;restart is preferred. two issues with that:

tehn commented 4 years ago

thanks for bringing this up.

SYSTEM > RESET should do a "clean" no-script restart, ie, delete the system.pset etc and reset to known working values. the current behaviour is misleading and doesn't solve all problematic use cases.

is it frequent that people somehow lock up their onboard UI? (ie knobs/screen?)

catfact commented 4 years ago

yes, it is easy to lock up the UI with scripting errors or due to system bugs. for example if we print some error message at 1000hz then the main loop is borked.

hence the proposal to have a clean reset executed directly from GPIO module, which seems easy ( i can PR it if you like)

simonvanderveldt commented 4 years ago

Some quick questions

issuing ;restart to the matron repl in maiden. this will cause a lot of runtime problems, since matron needs metadata about running engines/polls.

Why is this? Does this means that if I restart matron I get into a broken state? It doesn't just get the required info when it starts?

Do we need four different types/ways (2-5 in the list) to reset stuff?

hence the proposal to have a clean reset executed directly from GPIO module, which seems easy ( i can PR it if you like)

We could also use a watchdog for this, although that's of course quiet invisible to the user/not user initiated.

tehn commented 4 years ago

here's my proposal:

tehn commented 4 years ago

ugh. on another note, undocumented feature: K1 held + SELECT from the menu clears the current script. though like SYS > RESET it's not accessible if the UI is locked up

tehn commented 4 years ago

partial fix by https://github.com/monome/norns/pull/1015

TODO: documentation

still open to conversation about watchdog/GPIO/etc

simonvanderveldt commented 4 years ago

a hard reset (white button) will have a proceeding boot in "not clean" state which means a script will not be loaded, so the user will not be locked out. hence i don't think it's necessary to have a GPIO detection method for clean boot. if we want a dirty boot to self-delete config data i'm fine with that, but it doesn't seem necessary as the user could then execute SYSTEM > RESET though that would only reset the levels, vports, etc.

I think it would be nice to ensure that after startup everything is always guaranteed to be in a working state, no additional actions needed. From @catfact's description I gather we already made sure everything was in the correct state when this happened based on the clean_shutdown flag he mentioned. Or is that not correct?

IMHO it would make sense if this would do the same thing as SYSTEM > RESET (maybe not restarting the services depending on when/where in the boot process we do this)

Also being able to reset (or restart) outside of SYSTEM > RESET and the hardware reset button means that we should be able to shutdown properly in more cases. Resets with the hardware reset button should only be necessary if the whole system (ie the kernel/Linux) hangs, not when the components of the norns stack hang, this prevents potential issues with disk/filesystem corruption/broken journals.

And @tehn you're right, watchdogs can be tricky and false positives is something we definitely don't want. Just wanted to mention it as an option. Personally I like the suggestion of having a complex key combination but I also see the issues with that combo then not being available for scripts to use so figured it might be a possible alternative solution.

tehn commented 4 years ago

honestly there isn't much to break at this point:

neither of these are a Fix. they are a quick solution to user error. which is why this hasn't gotten much attention up until now. the clean_shutdown flag is really just to prevent re-loading a failed script. and that totally works 100% even prior to today's fix.

the case of users truly stuck is uncommon and should only happen if someone if writing a script and gets in trouble... which hopefully can be solved via maiden resets/etc.

generally there should not be a common case where normal use of a script should cause a full lockup for someone, but of course this is possible.

i also don't think we realistically have seen repeatable lockups of the kernel or underlying norns components--- just the lua environment. so any fix should mostly address that. key combination is an option.

catfact commented 4 years ago

Why is this? Does this means that if I restart matron I get into a broken state? It doesn't just get the required info when it starts?

well, it doesn't signal any change to the sclang process. so the running engine can be doing stuff, and in the case of engine polls it happens to be sending OSC that matron now doesn't understand.

we can patch this of course. i think reset_audio should just broadcast OSC to any/all connected process, engine interfaces should just handle this by shutting down current engine, and we should always broadcast this on startup just in case.

but here's a simplifying suggestion:

make a ;restart in either REPL on maiden, cause both services to restart, and maybe the whole norns- service stack.

in fact: maybe rename ;restart to ;reset, and make it do the exact same thing as SYSTEM>RESET.
wouldn't that clarify things?

my assumption is that this:

SYSTEM > RESET should delete all system settings (levels, vport assignments, current script) which will get the software in a "clean" state (these settings are pretty minimal so it's not a hassle)

can be accomplished just by resetting all the systemd services, (causing a dirty-boot), and ensuring that handling a dirty-boot means deleting vports/psets. right?


generally there should not be a common case where normal use of a script should cause a full lockup for someone, but of course this is possible.

it happens. the norns API is very complicated and we can't pretend to have tested every possible script interaction for errors in the system stack on every update. so we really can't assume that it's user error or even a scripting error when the UI hangs for whatever reason.

the main reason i'm bringing up a separate hardware-driven reset, is: i have some suspicion that the white button is really very bad.

i spent a little time trying to research the susceptibility of XNAND flash to permanent damage from power loss during write. i didn't do enough to say for sure and it's a complicated topic, but i think it's a risk. (basically has a chance of creating a permanently-bad sector.)

even if it "only" corrupts the filesystem, in this environment that can mean hardware damage.

anyways, of course it's completely up to you whether a key-combo to restart is acceptable. it could be a really long hold (30s) or even something like a triple-hold plus a sequence of encoder turns. the purpose of such a thing would be to really do as much as possible to give people as few excuses as possible for using the white button under any circumstance.

catfact commented 4 years ago

in fact, here's an additional suggestion:

;reset can reset all both processes and do journalctl and dmesg or whatever to assemble an error report. is that crazy?

tehn commented 4 years ago

i actually think a long hold (10s) in a particular sequence (k3,k2,k1 or something) would be a good idea. granted this would be managed by matron, so it's less immune than maiden's ;reset... but i still think it's a good idea.

logging, sure. would be possibly sensible to dump a log to ~/dust/data so it's view-able in maiden. i'm unsure which logs exactly we'd want.

tehn commented 4 years ago

here's my proposal:

note that if the norns is shut down (or matron restarted) in any way other than SLEEP it will boot up the next time dirty.

simonvanderveldt commented 4 years ago

matron: add lower level reset function (which basically issues systemd commands)

Would this be different from systemd restart norns-* as listed in point 4?

maiden: add ;restart in either window which does systemd restart norns-* (this will cause a dirty load of matron)

Could the dirty startup, which will trigger item 3 if I understand it correctly, somehow interfere with/be annoying when developing Lua (like the menu code you mentioned) or SC code from maiden?

P.S. We could probably use systemd's PartOf dependency so you only need to restart one item (probably norns.target)

PartOf=

Configures dependencies similar to Requires=, but limited to stopping and restarting of units. When systemd stops or restarts the units listed here, the action is propagated to this unit. Note that this is a one-way dependency — changes to this unit do not affect the listed units.

tehn commented 4 years ago

Would this be different from systemd restart norns-* as listed in point 4?

bullet 1 isn't really needed. just issuing an os command is sufficient.


i was not suggesting removing maiden's ;restart ;start and ;stop for matron. so that could still be used for development.