regression: `[loadbang]->[; pd quit(` is deferred

umlaeute commented 3 months ago

i just noticed that some of my unit-tests started to fail, because since Pd-0.55, a message sent with commandline arguments is executed before anything triggered with [loadbang]. With Pd<0.55 it was the other way round.

I used this behaviour for unit-testing:

a patch checks some behavior, and if it succeeds, it quits Pd with exitcode 0
if the patch does not work properly at all (e.g. because an external is being tested, and the external could not be loaded, so we have a non-functional object that stalls the execution), Pd exits with some other exitcode, triggered by -send "pd quit 100"

here's a super-simple example (call it quitme.pd):

#N canvas 2655 473 450 300 12;
#X obj 86 108 loadbang;
#X msg 86 133 \; pd quit 0;
#X connect 0 0 1 0;

(aka:

[loadbang]
|
[; pd quit 0(

The patch is run with pd -nogui -open quitme.pd -send "pd quit 2" ; echo $?

With Pd<0.55, this gives me 0, with Pd>=0.55, it returns 2.

umlaeute commented 3 months ago

on further inspection, it's slightly weirder.

a pd quit triggered by a [loadbang] is deferred until some time later. the pd quit that is injected via -send is executed immediately.

this patch:

[loadbang]
|
[t b b]
|     |
|     [; pd quit(
|
[print ouch]

will print ouch: bang with Pd-0.55, but will not print anything with Pd-0.54. I think the old behaviour is correct

Spacechild1 commented 3 months ago

I think the old behaviour is correct

It depends on your definition of "correct".

The "quit" message used to force quit Pd with exit(). Since Pd 0.55 it just sets a flag to let Pd break from the scheduler loop, close all canvases and in the future also clean up all resources. Naturally, Pd cannot break in the middle of a message, that's why you see the left trigger being executed.

I guess we could add second argument to "quit" which lets you force quit.

In the meantime, I think you need to restructure your test suites.

Spacechild1 commented 3 months ago

In your case, I think this can be easily fixed with a [del 0] after the left trigger.

Now, if you think that the new (correct :) behavior of "quit" is too much of a breaking change, we could let the "quit" message force-quit by default, with the option of doing a clean quit with a second argument (which would be used by the GUI).

Personally, I would prefer to keep the current behavior as the default.

umlaeute commented 3 months ago

It depends on your definition of "correct".

so why does a -send "pd quit 1" take precedence? without looking at the code, i guess that the exit code is set by the last "pd quit" that was received by Pd.

e.g.

[loadbang]
|
[t b b]
|     |
|     [; pd quit 0(
|
[; pd quit 1(

now exits with 1.

i still think that this is buggy behaviour.

doing a clean quit with a second argument

note that the quit message already has a 2nd argument: the exit code.

Personally, I would prefer to keep the current behavior as the default.

may i ask why? i do like a clean quit, but for me this means that the destructors of the various objects are called, not that the control flow gets re-ordered.

In the meantime, I think you need to restructure your test suites.

the problem with test-suites is, that there are often many many (small) testpatches that need to be updated.

Spacechild1 commented 3 months ago

without looking at the code, i guess that the exit code is set by the last "pd quit" that was received by Pd.

Yes. I didn't anticipate that someone would send several "quit" messages in the same clock tick... Here's a possible fix:

#define NOEXITCODE INT_MAX

static int sys_exitcode = NOEXITCODE;

if (sys_exitcode == NOEXITCODE)
    sys_exitcode = status;
else
    pd_error(0, "quit already called with exit code %d", sys_exitcode);

note that the quit message already has a 2nd argument: the exit code.

The exit code is the 1st argument ;-)

i do like a clean quit, but for me this means that the destructors of the various objects are called, not that the control flow gets re-ordered.

Well, you cannot call destructors in the middle of a clock tick. That's why "quit" only takes effect after the current clock tick has finished. There is no way around it. It's just that the existing behavior of "quit" was wrong and people apparently started to rely on it... If you read the "quit" message as "quit Pd after the current clock tick", there is no reordering of control-flow.

the problem with test-suites is, that there are often many many (small) testpatches that need to be updated.

Hmmm. That's a bummer. I'm not sure what to do. On the one hand, I don't want you to do lots of tedious work. On the other hand, I really think the current behavior is correct...

chr15m commented 3 months ago

This backwards incompatible change could affect a number of real world installations where people are relying on existing behavior.

Spacechild1 commented 3 months ago

This backwards incompatible change could affect a number of real world installations where people are relying on existing behavior.

Maybe. What about a compatibility flag? Alternatively, we could keep the old (wrong) behavior of "quit" as the default and offer the new (correct) behavior via a second argument. I don't really mind. @millerpuckette Any preferences?

Either way, the GUI would use the correct behavior.

chr15m commented 3 months ago

we could keep the old (wrong) behavior of "quit" as the default

I guess this typically what Miller does, erring on the side of backwards compatibility. Though I saw on the list he is looking at changing some stuff a bit so maybe it's a good time to change other things.

Anyway I don't have huge stakes in this I am just always thinking about the silent majority of users of software and thought I'd vouch for them!

Spacechild1 commented 3 months ago

I am just always thinking about the silent majority of users of software and thought I'd vouch for them!

Generally, that's also my position. You both are certainly right that there is a breaking change in Pd 0.55. Now the question is how to handle/fix it.

umlaeute commented 3 months ago

Yes. I didn't anticipate that someone would send several "quit" messages in the same clock tick... Here's a possible fix:

oh, i would just do it like:

void sys_exit(int status)
{
    static int firsttime = 1;
    pthread_mutex_lock(&sched_mutex);
    if (fisttime) {
       sys_exitcode = status;
    } else {
       pd_error(0, "quit already called with exit code %d", sys_exitcode);
    }
    firsttime = 0;
    sys_quit = SYS_QUIT_QUIT;
    pthread_cond_signal(&sched_cond);
    pthread_mutex_unlock(&sched_mutex);
}

Well, you cannot call destructors in the middle of a clock tick.

why not? (I'm sure there are reasons, but it's better to be explicit). afaik, all objects always have a consistent internal state, and this is what the destructor is supposed to cleanup. i don't see how the clock tick is related to that.

multithreading complicates things, but multithreading is not related to clock ticks either.

That's why "quit" only takes effect after the current clock tick has finished. There is no way around it. It's just that the existing behavior of "quit" was wrong and people apparently started to rely on it...

how is it wrong? and why can there be no way around it? (e.g. one possibility would be to stop the dispatcher once sys_quit is set. this way the stack would properly unwind (and no new "pd quit" messages could possibly appear)

If you read the "quit" message as "quit Pd after the current clock tick", there is no reordering of control-flow.

yes, but why would I read the "quit" message like that? if could also read the "quit" message as "quit Pd after working hours at 5pm", but that doesn't make my (our your) interpretation any more "correct".

oh, and please let us not get into compatibility flags. (at least not, until you have convinced me, that the old behaviour is indeed buggy/incorrect)

umlaeute commented 3 months ago

actually i think the proper (at least: simplest) way would be to just:

if (SYS_QUIT_QUIT != sys_quit) {
   sys_exitcode = status;
} else {
   pd_error(0, "quit already called with exit code %d", sys_exitcode);
}

Spacechild1 commented 3 months ago

actually i think the proper (at least: simplest) way would be to just:

+1 So obvious in hindsight 🤦‍♂️

Spacechild1 commented 3 months ago

Well, you cannot call destructors in the middle of a clock tick.

why not? (I'm sure there are reasons, but it's better to be explicit). afaik, all objects always have a consistent internal state, and this is what the destructor is supposed to cleanup. i don't see how the clock tick is related to that.

Obviously, we cannot free Pd's resources while message passing is still in progress, as this would essentially pull the rug from under our feet. We have to wait after the current clock tick has been finished. That's what sched_tick is doing: https://github.com/pure-data/pure-data/blob/14d8b9702db5ff2e448a600f4e32e998271dbd8a/src/m_sched.c#L289-L313

Note that I did not change the implementation of sched_tick (except for a minor detail), so that was apparently the intention from the start. At the same time, glob_exit used to just call exit, defeating the code in sched_tick.

(e.g. one possibility would be to stop the dispatcher once sys_quit is set. this way the stack would properly unwind (and no new "pd quit" messages could possibly appear)

We could add checks for sys_quit in all outlet_* and bindlist_* functions to stop the message passing, but this would add lots of additional if-checks just to workaround an edge case.

So I stay with my two suggestions:

keep the old behavior of "quit" as the default, but allow to get the new behavior with a new second argument; the GUI would always use the latter to initiate a clean shutdown
get the old behavior with a compability flag

Spacechild1 commented 3 months ago

I can make a PR for 1.

umlaeute commented 3 months ago

i've pushed the change from https://github.com/pure-data/pure-data/issues/2339#issuecomment-2160452524 to develop. so i think the most important problem for me is solved right now.

it doesn't yet fix the problem of

[bang(
|
[t b b]
|     [; pd quit(
|
[start ww3]

(but your tentative PR might).

We could add checks for sys_quit in all outlet_* and bindlist_*

the obvious entry point for protecting the outlet_* (and i think bindlist_*) is to just hack into stackcount_add():

static INLINE int stackcount_add(void)
{
    if (SYS_QUIT_QUIT == sys_quit)
        return 0;
        /* set overflow flag to prevent any further messaging */

(it will give an ugly "stack overflow" error for each "fan-out" (including triggers) that is currently being processed, but that's probably good enough...)

Spacechild1 commented 3 months ago

the obvious entry point for protecting the outlet* (and i think bindlist*) is to just hack into stackcount_add():

bindlist currently does not call stackcount_add. Also, I think it would be hard to catch all edge cases. I wouldn't even give it a try.

Spacechild1 commented 3 months ago

So what's the problem with suggestion 1.? That would be a minimal local change which would completely restore the previous behavior (while still getting a nice shutdown when quitting from the GUI).

umlaeute commented 3 months ago

bindlist currently does not call stackcount_add

not directly, but indirectly.

also, i haven't been able to crash [text sequence] with following stackcount hack on top of 48cad5fb333a169bbb766717970808077e8f1620:

diff --git a/src/m_obj.c b/src/m_obj.c
index f58bfc32..1ad8f37a 100644
--- a/src/m_obj.c
+++ b/src/m_obj.c
@@ -364,8 +364,11 @@ static PERTHREAD int outlet_eventno;
     messages so that  the outlet functions can check to prevent stack overflow]
     from message recursion.  Also count message initiations. */

+extern int sys_quit;
 static INLINE int stackcount_add(void)
 {
+    if (1 == sys_quit)
+        return 0;
         /* set overflow flag to prevent any further messaging */
     if (++stackcount >= STACKITER)
         overflow = 1;
diff --git a/src/m_sched.c b/src/m_sched.c
index e2638c85..f08b8218 100644
--- a/src/m_sched.c
+++ b/src/m_sched.c
@@ -28,7 +28,7 @@
 #define SYS_QUIT_QUIT 1
 #define SYS_QUIT_REOPEN 2
 #define SYS_QUIT_CLOSE 3
-static int sys_quit;
+int sys_quit = 0;
 static pthread_cond_t sched_cond;
 static pthread_mutex_t sched_mutex;
 static int sched_useaudio = SCHED_AUDIO_NONE;

using this patch (invoked with pd -nogui -open test.pd -send "start bang"

#N canvas 2479 309 450 434 12;
#X obj 128 187 r foo;
#X obj 128 212 t b b b;
#X obj 176 251 print foo.1;
#X obj 64 261 print foo.3;
#X obj 212 90 text define -k \$0.foo;
#A set foo 1 \; foo 2 \;;
#X obj 124 149 print seq;
#X obj 124 124 text sequence \$0.foo -g;
#X obj 80 31 r start;
#X msg 146 316 \; pd quit 0;
#X msg 91 78 line 0 \, bang;
#X connect 0 0 1 0;
#X connect 1 0 3 0;
#X connect 1 1 8 0;
#X connect 1 2 2 0;
#X connect 6 0 5 0;
#X connect 7 0 9 0;
#X connect 9 0 6 0;

i think this patch is fairly minimal (but spans two files, and turns a static variable into a non-static variable)

So what's the problem with suggestion 1.?

well, i do think that a nice shutdown would be good for everyone. esp, I do not want to get rid of it, just because i would like my quit messages to be acted on promptly.

iiuc, suggestion #1 would be a way to defeat the orderly shutdown. stopping the dispatcher would instead keep the orderly shutdown but make it behave correctly (obviously with my definition of "correct": which is "everybody stop doing their work, cleanup and shut down")

Spacechild1 commented 3 months ago

well, i do think that a nice shutdown would be good for everyone. esp, I do not want to get rid of it, just because i would like my quit messages to be acted on promptly.

I see your point. However, adding all these if-checks would be a rather invasive change. It's certainly not something I would do willy-nilly. Also, with solution nr. 1 anybody can get an orderly shutdown with the "quit" message if they want, they just need to slightly change their patch.

Anyway, I think we really need the opinion of @millerpuckette.

danomatika commented 3 months ago

Without going into the details myself, I feel like pd finishing the current graph, then exiting makes sense. However, I acknowledge that I don't have anything that really relies on the old behavior.

Maybe the quit message could have an optional argument or float to mean "quit NOW" to exit immediately? ala pd quit 1

Spacechild1 commented 3 months ago

also, i haven't been able to crash [text sequence] with following stackcount hack on top of https://github.com/pure-data/pure-data/commit/48cad5fb333a169bbb766717970808077e8f1620:

Sorry, yes, there wouldn't be a crash because the cleanup would happen after we break from the dispatcher. But "quit" would not prevent further messaging in all cases:

#N canvas 656 319 450 300 12;
#X obj 59 162 text sequence foo -g;
#X obj 59 104 bng 19 250 50 0 empty empty empty 0 -10 0 12 #fcfcfc #000000 #000000;
#X msg 59 131 line 0 \, bang;
#X obj 57 33 array define bar;
#X obj 59 69 text define -k foo;
#A set bar resize 200 \; pd quit 0 \; bar resize 500 \;;
#X connect 1 0 2 0;
#X connect 2 0 0 0;

(EDIT: forgot -k flag in [array define]...)

My point is that objects can also be messaged directly, so we would actually need to add these checks to pd_typedmess, pd_float, pd_list, etc., which IMO would be gross.

Spacechild1 commented 3 months ago

Maybe the quit message could have an optional argument or float to mean "quit NOW" to exit immediately? ala pd quit 1

That is exactly what I was suggesting, but with the reversed meaning: quit <code> and quit <code> 0 would quit NOW and quit <code> 1 would quit gracefully (-> used by the GUI).

In theory, we could also make quit <code> default to quit <code> 1, where 1 means "NOW", but that would be a bit unusual, as optional arguments in Pd typically default to 0.

danomatika commented 3 months ago

Alternatively, there could be a new exit message which quits immediately also set's Pd's return code. Then updating would be find/replace pd quit with pd exit 1 or whatever. Silver linings?

Spacechild1 commented 3 months ago

Alternatively, there could be a new exit message which quits immediately also set's Pd's return code. Then updating would be find/replace pd quit with pd exit 1 or whatever. Silver linings?

I like this! Not really backwards compatible, but as you said, patches would be easy to fix with a simple find/replace. "exit" would simply call exit (like "quit" used to do).

umlaeute commented 3 months ago

@Spacechild1 wrote:

I see your point. However, adding all these if-checks would be a rather invasive change.

that's what i initially thought. but then, the implementation is now a single additional if, within a(n inline) function that already does checks some condition. the code is run for each message, so overhead matters. i still trust that the actual overhead is small enough.

@danomatika wrote

My point is that objects can also be messaged directly, so we would actually need to add these checks to pd_typedmess, pd_float, pd_list, etc., which IMO would be gross.

all these funtions already check the stack (stackcount_add()). which is probably gross already, so I don't think that my suggestion is much worse. we could of course just lower the stacksize to 0, so we only need a single comparison (but would need to compare against a variable, rather than a constant)

@Spacechild1

But "quit" would not prevent further messaging in all cases:

is your example missing the data?

@danomatika wrote

Alternatively, there could be a new exit message which quits immediately also set's Pd's return code. Then updating would be find/replace pd quit with pd exit 1 or whatever.

that seems to be backward. we already had an old quit message that quits immediately and also sets Pd's return code. rather than adjusting each and every patch that uses quit in the old ways (however few), it would make more sense for those who want the new behavior to use the (tentative) new exit message. this way, nothing breaks, and we still get the new beahviour.

(i guess the motivation is, to get the cleanup for free for patches that use pd quit without caring)

Silver linings?

probably

Spacechild1 commented 3 months ago

all these funtions already check the stack (stackcount_add()).

They don't! For example:

https://github.com/pure-data/pure-data/blob/14d8b9702db5ff2e448a600f4e32e998271dbd8a/src/m_pd.c#L278-L284

Same for pd_symbol, pd_list, etc. and also for pd_typedmess, pd_vmess, etc.

is your example missing the data?

See my EDIT :)

rather than adjusting each and every patch that uses quit in the old ways (however few), it would make more sense for those who want the new behavior to use the (tentative) new exit message.

But do we really then need a new message? Why not add an argument to "quit"? I think we are back to my suggestion nr. 1

umlaeute commented 3 months ago

Same for pd_symbol, pd_list, etc. and also for pd_typedmess, pd_vmess, etc.

I think I actually tried...

Anyhow, I still do not understand what would be wrong in trying to stop the dispatcher, even if not all messages could be suppressed.

I start to understand better what (I think that) @millerpuckette had in mind when saying that the patch shouldn't get in the way of quitting Pd (that's from the top of my head, so I might misremember).

Spacechild1 commented 3 months ago

Anyhow, I still do not understand what would be wrong in trying to stop the dispatcher, even if not all messages could be suppressed.

I'm not a fan of half-baked solutions :) Also, it is a rather invasive change just to workaround a fringe use case.

So why not just add a second argument to "quit"? That would be a small local change with no side effects.

millerpuckette commented 3 months ago

Or... we could have 'quit' simply call exit() as before, and have a new message to quit politely. Then we're totally compatible and can still do clean exits from a patch if anyone ever wants to (I can't think of a reason but someone else probably will.) I'm not sure what to call the clean-quit message though since 'quit' is the natural name. Maybe "quit 4" should exit immediately and "quit clean 4" should quit politely with exit code 0.

Spacechild1 commented 3 months ago

Maybe "quit 4" should exit immediately and "quit clean 4" should quit politely with exit code 0.

That's basically what I was suggesting, but I would use a second (optional) argument instead: quit <exitcode> [<clean>] (default: clean = 0)

umlaeute commented 3 months ago

Same for pd_symbol, pd_list, etc. and also for pd_typedmess, pd_vmess, etc.

I think I actually tried...

i checked again, and noticed that my testing was flawed.

i'm still not convinced that it is a bad idea to hook into the overflow-protection to finish the graph as fast as possible without destroying the stack. (though i understand the concerns about "half-baked solutions").

Spacechild1 commented 3 months ago

i'm still not convinced that it is a bad idea to hook into the overflow-protection to finish the graph as fast as possible without destroying the stack.

I am :) Since the solution would be necessarily "half-baked", I don't see what it would buy us.

AFAICT, there only two practical ways of quitting Pd:

exit immediately (without) clean up
break from the scheduler after the current clock tick

Regarding the "quit" message, the concensus seems to be to keep 1. as the default behavior, but allow 2. as an option (either with a second argument or another message).

umlaeute commented 3 months ago

well.

just to add another data point: while https://github.com/pure-data/pure-data/commit/48cad5fb333a169bbb766717970808077e8f1620 fixed the problems with one testsuite, for another project i used a pattern like this:

[bang(
|
[until]
|
[textfile]
|        |
|        [; pd quit(

which now also stalls (and obviously must stall, if the actual quitting is deferred until [until] has finished). (and yes, the fix here is to also stop the [until] once the [textfile] is done)

i think i'll stop buying that this is a "fringe case". the [; pd quit( message (as opposed to "Quit via GUI") is basically broken.

yes, i know that once "the consensus" is implemented, the old and working behaviour will be restored)

Spacechild1 commented 3 months ago

the [; pd quit( message (as opposed to "Quit via GUI") is basically broken.

I think I've already acknowledged that there is a regression.

yes, i know that once "the consensus" is implemented, the old and working behaviour will be restored)

I'm not sure what we're still discussing then...

Spacechild1 commented 3 months ago

Here's a PR: https://github.com/pure-data/pure-data/pull/2346

@millerpuckette Actually, now I prefer your idea to have a dedicated "exit" message for graceful shutdown. I interpret it as "exit from scheduler", so the name can be justified :)

pure-data / pure-data

regression: `[loadbang]->[; pd quit(` is deferred #2339