squeak-smalltalk / squeak-object-memory

Issues and assets related to the Squeak object memory.
https://bugs.squeak.org
MIT License
11 stars 1 forks source link

BlockCannotReturn when terminating a time-restricted process #108

Open LinqLover opened 6 months ago

LinqLover commented 6 months ago
process := [[1 second wait] valueWithin: 2 seconds onTimeout: []] newProcess.
process runUntil: [:c | c isClosureContext and: [c size > 0 and: [c top isKindOf: Process]]]. "<-- simulate a possible context switch in primitiveResume"
process step.
process terminate.

Expected: The stack of the process is fully unwound, and the process is terminated correctly. Actual: The following debugger appears (BlockCannotReturn):

image

This is not a regression and failed in 5.3, too (but with a different error).

I have no idea yet why this happens, but it is blowing up every few minutes in my current multiprocessing-intensive project. Fix or workaround would be highly desired.

Cc'ing @isCzech in case this bug interests you :)

isCzech commented 6 months ago

Hi Christoph,

here's what's happening:

  1. your runUntil: drives the process right before watchdog resume in #valueWithin:onTimeout:
  2. process step then resumes the watchdog process but won't enter the ensure: argument block, hence
  3. theProcess := nil assignment won't happen when the process is terminated and...
  4. because the process is not nil when the timout expires, the watchdog assumes the receiver hasn't finished, and...
  5. as a result it raises the TimedOut exception via #signalException: and there things finally fail miserably (you can try to place self halt to ##signalException: before self resume and see)

I'm not sure what would be the best way to fix it but a suggestion addressing your example is in Chronology-Core-jar.89

Let me know what you think...

Thanks for this delightful example ;) best, Jaromir

The original version:

valueWithin: aDuration onTimeout: timeoutBlock
    "Evaluate the receiver.
    If the evaluation does not complete in less than aDuration evaluate the timeoutBlock instead"

    | theProcess delay watchdog tag |

    aDuration <= Duration zero ifTrue: [^ timeoutBlock value ].

    "the block will be executed in the current process"
    theProcess := Processor activeProcess.
    delay := aDuration asDelay.
    tag := self.

    "make a watchdog process"
    watchdog := [
        delay wait.     "wait for timeout or completion"
        theProcess ifNotNil:[ theProcess signalException: (TimedOut new tag: tag)]   "<--- here's a problem"
    ] newProcess.

    "Watchdog needs to run at high priority to do its job (but not at timing priority)"
    watchdog priority: Processor timingPriority-1.

    "catch the timeout signal"
    ^ [ watchdog resume.                "start up the watchdog"
        self ensure:[                       "evaluate the receiver"
            theProcess := nil.              "it has completed, so ..."
            delay delaySemaphore signal.    "arrange for the watchdog to exit"
        ]] on: TimedOut do: [ :e | 
            e tag == tag 
                ifTrue:[ timeoutBlock value ]
                ifFalse:[ e pass]].
isCzech commented 6 months ago

Hi Christoph, IMO it really is a bug and this seems to be the right fix: Chronology-Core-jar.90

LinqLover commented 6 months ago

Hi Jaromir, thanks a lot for the investigation! Both your fixes look good to me but the second one (Chronology-Core-jar.90) seems more simple as it avoids the complexity of testing process states. I have tested it in my image for one day and the number of unexpected BCRs has dropped significantly. :-)

(Still, I keep seeing some rare BCRs with a different stack that unfortunately freeze my image so I cannot investigate them, but I don't even know whether they are related to valueWithin:onTimeout:, so let's put that aside for now ...)

By the way, initially I assumed that when the watchdog is preempted right before signaling the timeout, there could be another synchronization issue ... but this will never happen because it has a higher priority than theProcess. Could there be any other synchronization issues with this method? Right now I can't find any. :-)

isCzech commented 6 months ago

Hi Christoph, out of curiosity, could you try Chronology-Core-jar.91 ? My guess is it's equivalent to Chronology-Core-jar.90 but terminating the watchdog seems a cleaner solution and there's a chance it may have a slightly different outcome.

Could there be any other synchronization issues with this method?

It's a tricky neighborhood :) #signalException: may have unexpected side effects because it resumes the receiver process. If this process is waiting on a condition variable, e.g. on a semaphore or a delay, it will be taken out of this wait and will continue after signaling the exception. See these examples (edit: wrong examples; see the next post):

[Transcript show: 'I''ll wait for 10 seconds... '. 10 seconds wait] valueWithin: 1 seconds onTimeout: [Transcript show: 'oops, maybe not!  ']
[Transcript show: 'I''ll wait on a semaphore... '. Semaphore new wait] valueWithin: 1 seconds onTimeout: [Transcript show: 'oops, maybe not!  ']

I'm not sure what should be the right expectation... to do what it currently does or continue waiting? What would you expect? Is this scenario something you use in the Trace debugger or the Simulation studio?

isCzech commented 6 months ago

Sorry, wrong examples in my previous post - they don't demonstrate the point. Try these instead:

[Transcript show: 'I''ll wait for 10 seconds... '. 10 seconds wait] valueWithin: 1 seconds onTimeout: [Transcript show: 'time is up... '].
Transcript show: 'am I cheating?  '
[Transcript show: 'I''ll wait on a semaphore... '. Semaphore new wait] valueWithin: 1 seconds onTimeout: [Transcript show: 'time is up... '].
Transcript show: 'bye, semaphore!  '

Something doesn't feel right here... In this example the process won't leave the semaphore:

p := [Semaphore new wait] forkAt: Processor activePriority+1.
p signalException: Notification.
p isBlocked  "<----- true"
isCzech commented 6 months ago

Hi Christoph, I guess the above examples show the intended semantics of #valueWithin:onTimeout: after all. Sorry for the noise :)

If you find the changsets correctly fixing the bug in #valueWithin:onTimeout:, could you please merge? I'd prefer Chronology-Core-jar.91 but Chronology-Core-jar.90 is also fine. (and please remove the remaining two from the Inbox)

Thanks, Jaromir

LinqLover commented 5 months ago

Hi Jaromir,

sorry for not replying earlier ...

out of curiosity, could you try Chronology-Core-jar.91 ? My guess is it's equivalent to Chronology-Core-jar.90 but terminating the watchdog seems a cleaner solution and there's a chance it may have a slightly different outcome.

Yes, that's probably cleaner! Nice find! I have loaded it into my image and will report after a few days ... (by the way, with Chronology-Core-jar.90, I still saw a - though much smaller - number of errors from this corner, but I unfortunately did not record them. Will do so when I see them again.

Regarding your other examples: I agree with your conclusion that this is expected behavior. Just like Semaphore>>#signal, Process>>#signalException: indicates to me the intent of moving out the possibly waiting process of any suspension state.

LinqLover commented 5 months ago

Hi Jaromir,

I have merged Chronology-Core-jar.91 (and Chronology-Core-jar.90 because you based jar.91 on its ancestry) to the trunk and added a test from my above example. By the way, the both are not fully equivalent - jar.90 did not pass my test because, surprise, delay delaySemaphore signal does not make the waiting process answer true to isTerminated. Decide you whether that is expected behavior of isTerminated or not. :-)

Apart from that, I noticed some other strange thing: If you try to step over TestValueWithinFix debug: #testValueWithinTimingBasic (or some of the other tests), the test will fail. stepThrough is even weirder. Just in case you are interested in investigating that, too. :-)