hpi-swa / trufflesqueak

A Squeak/Smalltalk VM and Polyglot Programming Environment for the GraalVM.
MIT License
286 stars 14 forks source link

Squeak 6.0 beta image can not be saved #159

Closed dram closed 2 years ago

dram commented 2 years ago

In TruffleSqueak, Squeak 6.0 beta image can not be saved, no error is displayed, neither in Squeak window, nor in the console.

dram commented 2 years ago

Have a more close look at this issue, the saving process seems be stalled at SmalltalkImage>>#processShutDownList: in SmalltalkImage>>#snapshot:andQuit:withExitCode:embedded:.

And I found something suspicious, there is an extra process Delay>>wait in TruffleSqueak with 6.0 beta image, which does not exist in OpenSmalltalk VM, i.e.:

image

BTW, this extra process also does not exist in TruffleSqueak with TruffleSqueak-22.1.0 image.

fniephaus commented 2 years ago

Thanks for the report and for doing some digging. It seems that EventTicklerProcess terminate causes the issue. Not sure what has changed recently, but using the following allows me to save the image again:

EventSensor>>#shutDown

    InterruptWatcherProcess ifNotNil: [
        InterruptWatcherProcess terminate.
        InterruptWatcherProcess := nil ].

    EventTicklerProcess ifNotNil: [
        "EventTicklerProcess terminate".
        EventTicklerProcess := nil. ].

    inputSemaphore ifNotNil:[Smalltalk unregisterExternalObject: inputSemaphore].

Any idea what could be going on here? /cc @marceltaeumel

fniephaus commented 2 years ago

Apparently, the semantics of Process>>#terminate have changed a bit. Maybe @LinqLover has an idea what could be going on.

LinqLover commented 2 years ago

A few months ago, #terminate has started to complete all ensure: contexts on the receiver stack first, even those that were already active before the termination request. Could there be any ensure: context like this on the stack of the event tickler process?

Also /cc @isCzech

isCzech commented 2 years ago

Hi @LinqLover, all, it's been a year since the extended #terminate logic has been introduced :) Before digging deeper, could you please filein and check the latest #terminate version (attached) to see if anything changes? The trunk #terminate still contains some bugs which have been fixed in the latest enclosed version (expected to be merged into the release image).

And yes, the change in the #terminate logic only affects terminating processes suspended/terminated inside #ensure: unwind blocks. Kernel-jar.1447.followUp.cs.zip

isCzech commented 2 years ago

Follow up: To be precise, the enclosed changeset Kernel-jar.1447.followUp.cs.zip is meant to be used along with the updated #suspend attached here: RevisedSuspend.2.cs.zip

The new release VM (bundled in the beta image) supports new #suspend primitive 578 fixing incorrect #suspend semantics of the old primitive 88. The attached RevisedSuspend.2.cs.zip enables the image to use the new primitive.

Alternatively, instead of filing-in these two changesets you can try Kernel-jar.1447 from the Inbox to see if anything changes.

fniephaus commented 2 years ago

Thanks for the details! Could you summarize how primitive 578 is different to primitive 88?

isCzech commented 2 years ago

The change affects processes waiting on a Semaphore or Mutex; so far #suspend removed the waiting process from the semaphore or mutex and placed it in a run queue when resumed. That's a bug. Now, #suspend backs up the process's code before the wait so as a result the process returns to the same wait state when resumed.

Here's part of Eliot's comment for the revised #suspend: "...If the given process is not the active process, take it off its corresponding list. If the list was not its run queue assume it was on some condition variable (Semaphore, Mutex) and back up its pc to the send that invoked the wait state the process entered. Hence when the process resumes it will reenter the wait state. ..."

I'm not sure applying these changes will help as I don't understand the root cause. But it's a start :)

isCzech commented 2 years ago

I'm not familiar with TruffleSqueak but I've downloaded TruffleSqueakImage-22.1.0, used the latest VM version VMMaker.oscog-mt.3184 and tried two experiments: 1) updated the image via Squeak -> Update Squeak menu to the latest... saving works OK; then filed-in the two changesets attached above... again saving the image works OK 2) only filed-in the two changesets above to verify #terminate and #suspend work correctly... and again saving the image works OK

What was your scenario when saving the image failed and the eventTickler process didn't terminate. Where is the difference from my scenarios?

dram commented 2 years ago

Hi @isCzech , sorry that I should give more details in the issue description.

This issue can be reproduced using TruffleSqueak VM with recent Squeak 6.0 images, following are detailed steps under Windows:

  1. download graalvm-ce-java17-windows-amd64-22.1.0.zip from https://github.com/graalvm/graalvm-ce-builds/releases and unpack
  2. download trufflesqueak-installable-svm-java17-windows-amd64-22.1.0.jar from https://github.com/hpi-swa/trufflesqueak/releases
  3. install TruffleSqueak into GraalVM directory with command: graalvm-ce-java17-22.1.0\bin\gu –local-file install trufflesqueak-installable-svm-java17-windows-amd64-22.1.0.jar
  4. download Squeak6.0alpha-21736-64bit-202204190959-Windows-x64.zip from https://files.squeak.org/ and unpack (latest image is not compatible with TruffleSqueak 22.1.0, so a slightly older one is used here)
  5. copy Squeak6.0alpha-21736-64bit.image, Squeak6.0alpha-21736-64bit.changes and SqueakV50.sources into directory graalvm-ce-java17-22.1.0\languages\smalltalk\resources
  6. Launch TruffleSqueak with command graalvm-ce-java17-22.1.0\bin\trufflesqueak

I'll try your attached patches later.

dram commented 2 years ago

Hi @isCzech ,

I have tried your two changesets, image saving works after applied!

But there is some other problem, for each time I do a saving, there will be an extra process created, following is a screenshot of Process Browser:

image

fniephaus commented 2 years ago

Thanks for checking, @dram! TruffleSqueak does not implement primitive 578 yet, so maybe that's why you get to see these additional processes. Let me try and push an implementation.

isCzech commented 2 years ago

Hi @dram, I'm a bit confused now: you use Squeak6.0alpha-21736 and say later ones are not compatible. How did you test Squeak beta which is 21757 and later? About the incompatibility: Again, I'm not familiar with TruffleSqueak; I understand you don't use OpenSamlltalkVM but GraalVM - is it possible the GraalVM somehow doesn't work correctly with the latest Squeak images? I mean even the not so latest, causing incorrect behavior? Is there a version number that works and the next one doesn't? That would help :)))

isCzech commented 2 years ago

Hi @fniephaus,

TruffleSqueak does not implement primitive 578 yet, so maybe that's why you get to see these additional processes. Let me try and push an implementation.

Worth a try but as long as #suspend uses primitive 88, things should still work as before... A question: You tried using Cuis image - does it work ok? The reason I'm asking is Cuis implemented the latest #terminate a few months ago, so if Cuis image works ok, that means #terminate should work ok and the problem might lie somewhere else. (i.e. the inability to terminate the eventTickler could be a consequence of something else).

fniephaus commented 2 years ago

I tried implementing primitive 578 (see https://github.com/hpi-swa/trufflesqueak/commit/b926e17dfa332ff24746f936bab0de6b25df02f2) but unfortunately, I still see stale processes when saving the image.

You tried using Cuis image - does it work ok? The reason I'm asking is Cuis implemented the latest #terminate a few months ago, so if Cuis image works ok, that means #terminate should work ok and the problem might lie somewhere else. (i.e. the inability to terminate the eventTickler could be a consequence of something else).

I just opened the Cuis6.0-5171.image with a TruffleSqueak built from https://github.com/hpi-swa/trufflesqueak/commit/b926e17dfa332ff24746f936bab0de6b25df02f2 and saving works just fine, without any extra processes.

isCzech commented 2 years ago

I tried implementing primitive 578 (see b926e17) but unfortunately, I still see stale processes when saving the image.

You tried using Cuis image - does it work ok? The reason I'm asking is Cuis implemented the latest #terminate a few months ago, so if Cuis image works ok, that means #terminate should work ok and the problem might lie somewhere else. (i.e. the inability to terminate the eventTickler could be a consequence of something else).

I just opened the Cuis6.0-5171.image with a TruffleSqueak built from b926e17 and saving works just fine, without any extra processes.

Well, I guess it means the extended #terminate logic itself is not the culprit... I'll try to follow @dram 's steps and see :)

isCzech commented 2 years ago

Hi again, I can confirm I can reproduce the incorrect behavior (installed things as per steps 1 to 3, then for some reason step 4 didn't work at all - the image didn't start but I used TruffleSqueakImage 22.1.0 instead and updated the terminate method) It's strange that the same Truffle/Squeak image works ok with the OpenSmalltalk VM and behaves incorrectly with the GraalVM, while at the same time the Cuis image with the updated terminate works ok with both VMs!

I've noticed Cuis uses different priorities for the interrupt watcher and the low space watcher while Squeak runs all of them with the same priority, however, changing the priorities doesn't help. But there's another difference - for some reason Cuis has modified the EventSensor implementation, including the evenTickler, which may explain Cuis works flawlessly :) Maybe @jvuletich could shed some light here why (Hi Juan, I hope you don't mind :) )

fniephaus commented 2 years ago

Just in case you, @isCzech, are not aware: TruffleSqueak is an alternative Smalltalk VM implementation, so it's very likely that something on the VM level isn't working correctly. However, things used to work fine with older Squeak images and with Cuis images. I noticed that newly introduced ProcessTest crash TruffleSqueak, so I expect there to be at least a VM-level problem. But there's probably also something wrong on the image level as Cuis has demonstrated.

isCzech commented 2 years ago

Thanks indeed! I've realized you must have a different VM (not sure how different though and how the latest VM changes map into your VM); nonetheless you're absolutely right there must be something off on the image level if Cuis works fine either way :) (I haven't tested the new Process tests on an old image but yes, they may be very disruptive with the old image)

dram commented 2 years ago

Hi @dram, I'm a bit confused now: you use Squeak6.0alpha-21736 and say later ones are not compatible. How did you test Squeak beta which is 21757 and later? About the incompatibility: Again, I'm not familiar with TruffleSqueak; I understand you don't use OpenSamlltalkVM but GraalVM - is it possible the GraalVM somehow doesn't work correctly with the latest Squeak images? I mean even the not so latest, causing incorrect behavior? Is there a version number that works and the next one doesn't? That would help :)))

Hi @isCzech, the incompatibility which I mentioned is about issue #156. It is fixed recently, but not released yet. I tested Squeak6.0beta-21772 with TruffleSqueak build from main branch.

If you need to experiment with latest Squeak image, you can follow instructions in https://github.com/hpi-swa/trufflesqueak/blob/main/docs/development.md

isCzech commented 2 years ago

Hi @dram, thanks, I understand now. In the meantime I've come to a tentative conclusion there may be something wrong with the Squeak image other than the new termination logic. Not sure what though; it's possible fixing #terminate bugs may have unmasked some other issues - something Cuis has fixed or never suffered from. See above observations; I hope Juan (@jvuletich) could enlighten us why he changed the EventSensor implementation recently - whether he was possibly fixing some bugs...? In which case Squeak could follow suit :)

dram commented 2 years ago

Hi @isCzech, when looking around code of Process>>#terminate, I find following line a bit suspicious:

suspendedContext runUntilErrorOrReturnFrom: suspendedContext

According to the doc of Context>>#runUntilErrorOrReturnFrom:, the argument of it should be a sender of self, so I try to change this line to following, after that, image saving works without blocking:

suspendedContext runUntilErrorOrReturnFrom: ctxt

There is some similar code in your patched version, i.e.:

top runUntilReturnFrom: top

Similarly, after I change it to following, no more stale processes created after saving:

top runUntilReturnFrom: ctx

I'm not familiar with Squeak's process scheduling system, so not sure if such change is reasonable. Also I'm curious why there is no problem in OpenSmalltalk VM.

isCzech commented 2 years ago

According to the doc of Context>>#runUntilErrorOrReturnFrom:, the argument of it should be a sender of self,

Yes, the description "ASSUMES aSender is a sender of self" always bothered me :) I probably used the method beyond it's original intended use; however, it works correctly even when the receiver and 'aSender' are identical.

so I try to change this line to following, after that, image saving works without blocking:

suspendedContext runUntilErrorOrReturnFrom: ctxt

yes, it will cover most cases but I'm attaching a test that will fail with your modification; it is a bug in the original #terminate. It fails to recognize the case when the unwind block is the top context (of the stack). #runUntilReturnFrom: is a trivialized (or less restricted) version of #runUntilErrorOrReturnFrom: designed as a helper method for #unwindTo:; it's purpose is to run a fragment of another context stack. ProcessTest-testTerminateEnsureAsStackTop.zip

However, file-in the attached fix - it works in my case. I've compared Cuis's code and found this difference - thanks @jvuletich !! Nice catch :) EventSensor-eventTickler.zip

I don't understand yet why the same problem won't show with the Squeak VM...

isCzech commented 2 years ago

I don't understand yet why the same problem won't show with the Squeak VM...

I wonder if speed of the processing could have anything to do with it?

Anyway, if you let me know whether the fix works in all your scenarios, I'll send the fix to Squeak Inbox. Thanks

dram commented 2 years ago

Hi @isCzech, I have a test of EventSensor-eventTickler patch, it seems that the stale process problem still exists.

Tests are taken in two environments:

Steps:

  1. launch the clean Squeak image
  2. file in EventSensor-eventTickler.st, Kernel-jar.1447.followUp.cs and RevisedSuspend.2.cs
  3. open Process Browser
  4. Trigger saving from menu bar
dram commented 2 years ago

And regarding to ProcessTest-testTerminateEnsureAsStackTop, seems that it failed in TruffleSqueak with EventSensor-eventTickler.st and Kernel-jar.1447.followUp.cs filed in.

image

The assertion failure is caused by following line:

self assert: p1 isTerminated & p2 isTerminated & p3 isTerminated.
jvuletich commented 2 years ago

WRT #eventTickler in Cuis, it was last tweaked in December 2021 by Andrés Valloud. There is a possible weakness in Delay. If a Delay is waiting, but its process is terminated, and then the same delay is sent #wait without checking #beingWaitedOn, the system wil hang. See senders of #wait and #beingWaitedOn in Cuis. In any case, I don't know if this has any relation with the problems when saving Squeak in TruffleSqueak. I'm just answering Jaromir's question about #ventTicker in Cuis.

isCzech commented 2 years ago

And regarding to ProcessTest-testTerminateEnsureAsStackTop, seems that it failed in TruffleSqueak with EventSensor-eventTickler.st and Kernel-jar.1447.followUp.cs filed in.

image

The assertion failure is caused by following line:

self assert: p1 isTerminated & p2 isTerminated & p3 isTerminated.

yes, sorry, I haven't realized you don't have the full set of patches... It also depends on the combination of the VM+image :)

isCzech commented 2 years ago

Hi @isCzech, I have a test of EventSensor-eventTickler patch, it seems that the stale process problem still exists.

Tests are taken in two environments:

  • Debian Linux 11, TruffleSqueak (main branch), Squeak6.0beta-21772
  • Windows 11, TruffleSqueak 22.1.0, Squeak6.0alpha-21736

Steps:

  1. launch the clean Squeak image
  2. file in EventSensor-eventTickler.st, Kernel-jar.1447.followUp.cs and RevisedSuspend.2.cs
  3. open Process Browser
  4. Trigger saving from menu bar

Hi @dram, at this point it shouldn't matter whether you apply Kernel-jar.1447.followUp.cs and RevisedSuspend.2.cs because they have nothing to do with the issue. Could you please tell me whether the problem persists when you apply only the EventSensor-eventTickler.st patch? (with the recent Squeak image indeed) ?

Does it mean that the problem disappeared on Windows 10 and possibly some other scenarios? (I have no way to test Win11/Linux/Mac)

It seems, as Juan indicated, it may be a timing issue caused by the eventTickler implementation. I only guess the new terminate made the problem more visible in certain scenarios. It never occurred in Cuis/Squeak though and I may only hypothesize it could be because of the speed of the VM?? I noticed GraalVM is slower... Thanks!

isCzech commented 2 years ago

WRT #eventTickler in Cuis, it was last tweaked in December 2021 by Andrés Valloud. There is a possible weakness in Delay. If a Delay is waiting, but its process is terminated, and then the same delay is sent #wait without checking #beingWaitedOn, the system wil hang. See senders of #wait and #beingWaitedOn in Cuis. In any case, I don't know if this has any relation with the problems when saving Squeak in TruffleSqueak. I'm just answering Jaromir's question about #ventTicker in Cuis.

Thank you @jvuletich for your thoughts. I noticed you refactored EventSensor in December and Andres tweaked the eventTickler and I wondered whether you discovered some problem/bug you were fixing or just improving things proactively :)

Here is for sure an interplay between the eventTickler and termination and I guess it's exactly what you say but I didn't go as deep as to fully understand it yet :) Thanks again!

isCzech commented 2 years ago

Hi @dram,

I have a test of EventSensor-eventTickler patch, it seems that the stale process problem still exists.

Oops, apologies, I forgot to replace the line top runUntilReturnFrom: ctx you suggested while testing the eventTickler "fix" so no, the eventTickler "fix" isn't a fix :) I'll keep looking...

dram commented 2 years ago

Hi @isCzech,

After some digging, I think that the bug revealed by ProcessTest-testTerminateEnsureAsStackTop is not related to runUntilErrorOrReturnFrom:, as this test also failed in clean Squeak beta image running under Open Smalltalk VM, I'm using Squeak6.0beta-21829-64bit-202205110711-Windows-x64 to run this test.

And while experimenting, I find that test can be fixed by following change, not sure if it is reasonable:

image

dram commented 2 years ago

Based on ProcessTest-testTerminateEnsureAsStackTop, I constructed a simplified test, which reveals something interesting regarding to TruffleSqueak.

Printing following code in Workspace, in Squeak and Cuis running under OpenSmalltalk VM, both return reached here.

p := [[] ensure: []] newProcess.
[p suspendedContext isUnwindContext] whileFalse: [p step].
p terminate.
'reached here'

Printing previous code in Squeak6.0alpha-21736 under TruffleSqueak 22.1.0, a nil is returned, with following message in TruffleSqueak console:

[trufflesqueak] Running Squeak6.0alpha-21736-64bit.image on GraalVM CE (latency mode)...
[trufflesqueak] Image loaded in 1890ms.
[trufflesqueak] Unwind error: sender of CTX [] in [] in BlockClosure>>newProcess @756b58a7 is nil, unwinding towards CTX Compiler>>evaluateCue:ifFail: @2cc04358 with return value: nil

Printing previous code in Cuis6.0-5171 under TruffleSqueak 22.1.0, VM will be crashed, with following error:

[trufflesqueak] Running Cuis6.0-5171.image on GraalVM CE (latency mode)...
[trufflesqueak] Image loaded in 453ms.
[trufflesqueak] Unwind error: sender of CTX [] in Unknown behavior>>newProcess @76f856a8 is nil, unwinding towards CTX Unknown behavior>>evaluateMethod:to:logged:profiled: @7c853486 with return value: nil
ERROR: java.lang.ArrayIndexOutOfBoundsException: Index -178 out of bounds for length 89
org.graalvm.polyglot.PolyglotException: java.lang.ArrayIndexOutOfBoundsException: Index -178 out of bounds for length 89
        at de.hpi.swa.trufflesqueak.nodes.ExecuteBytecodeNode.fetchNextBytecodeNode(ExecuteBytecodeNode.java:164)
        at de.hpi.swa.trufflesqueak.nodes.ExecuteBytecodeNode.interpretBytecode(ExecuteBytecodeNode.java:101)
        at de.hpi.swa.trufflesqueak.nodes.ExecuteBytecodeNode.execute(ExecuteBytecodeNode.java:79)
        at de.hpi.swa.trufflesqueak.nodes.ResumeContextRootNode.execute(ResumeContextRootNode.java:44)
        at <smalltalk> CTX Unknown behavior>>evaluateMethod:to:logged:profiled: @7c853486(Unknown)
        at org.graalvm.sdk/org.graalvm.polyglot.Value.execute(Value.java:839)
        at de.hpi.swa.trufflesqueak.launcher.TruffleSqueakLauncher.execute(TruffleSqueakLauncher.java:131)
        at de.hpi.swa.trufflesqueak.launcher.TruffleSqueakLauncher.launch(TruffleSqueakLauncher.java:82)
        at org.graalvm.launcher.AbstractLanguageLauncher.launch(AbstractLanguageLauncher.java:296)
        at org.graalvm.launcher.AbstractLanguageLauncher.launch(AbstractLanguageLauncher.java:121)
        at org.graalvm.launcher.AbstractLanguageLauncher.runLauncher(AbstractLanguageLauncher.java:168)
isCzech commented 2 years ago

Hi @dram, the test ProcessTest-testTerminateEnsureAsStackTop requires the new #terminate method, however, the Kernel-jar.1447.followUp.cs version of #terminate is a simplified version requiring the latest OpenSmalltalk VM implementing the new #suspend primitive 578 - this setup should be the default in the new release. Unfortunately I gave you this simplified version before I realized GraalVM doesn't have the new primitive 578 - so the test will fail. That's the point - to show the current #terminate has this bug :) I don't think it has anything to do with the incorrect behavior we're observing here (I hope) and I may have an inkling where the problem is, just bear with me :) It's so darn complicated :D (You should be able to verify the test works if you filein Kernel-jar.1447.followUp.cs and RevisedSuspend.2.cs into the beta image running on the lastest OpenSmalltalk VM; it will FAIL with the TruffleSqueak VM though - that's expected).

If you wanted to play more, try the following version of #terminate that was supposed to work even with the older VM (no guarantee it will in you specific scenario though): Kernel-jar.1447-Terminate.1.cs.zip I hope all this mess will sort out with the new Squeak release. The good news is Cuis works with the latest #terminate and and TruffleSqueak, so the method and the test are ok. There must be an issue with Squeak's implementation (there's a difference in Squeak's Semaphore>>#critical: implementation which is my guess why Cuis works and I'll have to adjust the termination code in Squeak. Thanks for digging :)

dram commented 2 years ago

Hi @isCzech, according to the test case I mentioned in previous comment https://github.com/hpi-swa/trufflesqueak/issues/159#issuecomment-1138547121, Cuis's #terminate is also broken with TruffleSqueak, it will crash the VM.

isCzech commented 2 years ago

Hi @isCzech, according to the test case I mentioned in previous comment #159 (comment), Cuis's #terminate is also broken with TruffleSqueak, it will crash the VM.

True, this [trufflesqueak] Unwind error: ... in both cases is weird but I can't say what it means; I guess it's related to the TruffleSqueak VM which I know nothing about :)

isCzech commented 2 years ago

Hi @dram, I have some preliminary results: 1) I've fixed the test ProcessTest-testTerminateEnsureAsStackTop, you had a good hunch something was wrong - it was supposed to work with the change you suggested (top runUntilReturnFrom: ctx); check this: ProcessTest-testTerminateEnsureAsStackTop-fixed.zip

2) The change you suggested (top runUntilReturnFrom: ctx) should not change the semantics of termination (but I haven't tested it as thoroughly as the current version and there might be some border situations, e.g. when the unwind context is at the bottom of the stack); what it does is it just finishes executing the unwind context and if you look at the code, finishing it should be harmless; however - it is not very logical to finish what is not supposed to be finished (the process has been terminated and only unwind blocks are supposed to be executed). What's most interesting is it fixes your problem with TruffleSqueak :) BUT I still think there's something off with TruffleSqueak VM because the new terminate works on both Squeak and Cuis (actually it runs on Pharo as well!) so I'm inclined to believe it's TruffleSqueak VM's issue (I guess your results above also point in that direction if I understand correctly. And nice example "reached here", thanks!).

3) For info: Kernel-jar.1447.followUp.cs and RevisedSuspend.2.cs are supposed to work with the latest VM with #suspend primitive 578 (without it #isTerminated needs fixing) and Kernel-jar.1447-Terminate.1.cs (enclosed above) should work with the older VMs and with TruffleSqueak's (In both cases there may be some tests that need fixing, e.g. testAtomicSuspend)

4) I'd be very interested if you find out more - whether it's a VM problem and what's wrong; I traced the computation and can't find anything wrong. It's really surprising that the small change you suggested causing the unwind context finish executing fixes the issue for you! Maybe it has something to do with the fact that the top of the stack in this particular case is just a primitive call. But I have no experience with that :)

Thanks for letting me know. best, Jaromir

dram commented 2 years ago

Hi @isCzech,

I kind of think that the new #terminate may rely on some "tricky" feature in OpenSmalltak VM, which is not supported in TruffleSqueak. So there are two approaches to fix this problem:

  1. tweak #terminate to not rely on that feature
  2. add that feature to TruffleSqueak

Anyway, the first thing to do would be determine what that feature is. The error message and exception stack trace mentioned in https://github.com/hpi-swa/trufflesqueak/issues/159#issuecomment-1138547121 may give some hints.

isCzech commented 2 years ago

Hi @dram, why would you think it may be a "tricky" feature in OpenSmalltak VM and not in the TruffleSqueak VM? ;)

One of the differences between the old and new terminate is the old terminate uses the simulation machinery (the whole unwind is simulated using self popTo: suspendedContext bottomContext). The new terminate on the other hand replaced all simulation with a "regular", "no simulation, no exceptions" code. One of the advantages is you can more easily debug (debugging a simulation is tricky).

I've compared the termination step by step in both the failing case and the "fixed" case with top runUntilReturnFrom: ctx line replaced: The simulation (debugging) in both cases is identical and in both cases leads to correct termination of the tickler process. Debugging the same thing with the old version of terminate crashed the VM but that's expected with regard to the note above (debugging a simulation...).

Now, from that follows, I hope, that if the simulation (debugger) correctly terminates the event tickler process but the live run fails, the difference is most likely within the TruffleSqueak VM. So I'd say #terminate in TruffleSqueak relies on being run via the simulation machinery (#popTo & company). Once the TruffleSqueak VM trie to terminate using "regular" code, it fails, while at the same time when running the same "regular" code as a simulation (in debugger) it works correctly.

(It's a mystery why changing the one line "fixes" the issue for TruffleSqueak but as discussed in https://github.com/hpi-swa/trufflesqueak/issues/159#issuecomment-1138933169. I say it's not really a fix - it's rather a reminder something doesn't work as expected :) )

What's your thought?

Hi @fniephaus, does the exception stack trace mentioned in https://github.com/hpi-swa/trufflesqueak/issues/159#issuecomment-1138547121 tell anything to you? Would you agree with the conclusion above? Thanks!

fniephaus commented 2 years ago

Hi @dram and @isCzech, Thanks a lot for digging into this so much. Yes, I agree: it very much looks like there is at least one bug in TruffleSqueak somewhere. However, that "somewhere" could be relatively hard to find because scheduling, in general, is rather complex and has some interesting corner cases.

What would be very helpful are very short doIts that demonstrate problems in TruffleSqueak or a difference in OSVM vs TruffleSqueak behavior. Something like the doIt in https://github.com/hpi-swa/trufflesqueak/issues/159#issuecomment-1138547121 and some of the ProcessTests are already rather long. A lot of change sets have been posted in this thread and I almost lost track: how do I obtain a Squeak/Smalltalk image that has all the "right" methods and tests that can be expected to land in the next Squeak release?

isCzech commented 2 years ago

A lot of change sets have been posted in this thread and I almost lost track: how do I obtain a Squeak/Smalltalk image that has all the "right" methods and tests that can be expected to land in the next Squeak release?

The crucial point that makes the difference for TruffleSqueak image is the replacement @dram discovered in this comment. The suggested change will make TruffleSqueak work but I can't see any logical explanation why - IMO the answer lies within the TruffelSqueak VM. For OpenSmalltalk VM @dram's change has no effect and the original version (not working in TruffleSqueak) is more logical/consistent.

As for the expected release image, I propose including methods in Kernel-jar.1447.followUp.cs and RevisedSuspend.2.cs (attached previously) - but these two will require the latest VM with the 578 suspend primitive. The final decision should be made soon (two weeks timeframe?)

Just to make sure: the issue doesn't appear to be a timing or other issue with the event tickler.

isCzech commented 2 years ago

Just to make sure: the issue doesn't appear to be a timing or other issue with the event tickler.

Maybe it's too soon to rule this out... Cuis image with the new terminate seems to work fine with TruffleSqueak VM in terms of saving the image (and terminating the even tickler) but @dram's observation here shows there's an issue with Cuis as well.

fniephaus commented 2 years ago

I have a hunch what's going on: Process>>#terminate was changed and the actual termination logic was moved into an ensure block:

    [] ensure: [
        self suspend.
        context := suspendedContext ifNil: [^self].
        suspendedContext := 
            [context releaseCriticalSection; unwindTo: nil. self suspend] asContext.
        self priority: Processor activePriority + 1; resume]

It seems that the ensure block is never executed as part of the top-level unwinding logic: https://github.com/hpi-swa/trufflesqueak/blob/a068eca13ca09ac1de50bc4572dd53333fab8718/src/de.hpi.swa.trufflesqueak/src/de/hpi/swa/trufflesqueak/nodes/context/UnwindContextChainNode.java#L44-L57

That, at least, explains why all three processes of ProcessTest>>#testTerminateEnsureAsStackTop are never terminated. I'm trying to work out a fix, but this part of the VM is everything but trivial.

isCzech commented 2 years ago

Interesting! I wish you were right :) Cuis use the same approach; they just use a slightly different form: they use a method to get the receiver evaluated as an unwind block:

valueEnsured
    [] ensure: self

Applying it to the Squeak's #terminate at TruffleSqueak makes no difference though...

dram commented 2 years ago

Hi @isCzech, while rethinking about runUntilErrorOrReturnFrom:, I realized that if the whole point of using it instead of value is to deal with non-local return, why not do something like following:

unwindBlock hasMethodReturn ifTrue: [
    ... runUntilErrorOrReturnFrom: ...
] ifFalse: [
    unwindBlock value
]

After this change, image saving in TruffleSqueak works with no problem.

isCzech commented 2 years ago

Hi @dram, nice try ;) evaluate this though: [[] valueWithExit] hasMethodReturn Yes, the whole point of using #runUntilReturn is to deal with non-local returns. They may be nested however. But it's a great experiment, thanks!

dram commented 2 years ago

Hi @isCzech, thanks for pointing to valueWithExit, really a complex story. :)

Anyway, following is another try:

  1. filed in Kernel-jar.1447.followUp.cs and RevisedSuspend.2.cs into Squeak6.0beta-21829-64bit.image, and running under Open Smalltalk VM
  2. do following change in Context>>#unwindTo:safely:
        [ctx isNil] whileFalse: [
            (ctx tempAt: 2) ifNil: [
                ctx tempAt: 2 put: true.
    -           top := (ctx tempAt: 1) asContextWithSender: ctx.  "see the note below"
    -           top runUntilReturnFrom: top].
    +           [(ctx tempAt: 1) value] on: BlockCannotReturn do: []].
            ctx := ctx findNextUnwindContextUpTo: aContext]
  3. Run all tests in KernelTests-Processes group

All tests passed, so I'm a bit curious that why not use [... value] on: BlockCannotReturn do: [] instead of runUntilReturnFrom:? In my opinion semantics of these two are similar, and the former is more simple and straightforward.

isCzech commented 2 years ago

Hi @dram, this one's even better but nope, try this: [ [[Processor activeProcess terminate] ensure: [1/0]] on: ZeroDivide do: [] ] fork You have to run unwind blocks on the original stack to make sure you won't miss exception handlers. But you've discovered a missing test situation - thanks!

fniephaus commented 2 years ago

I just remembered that lots of Process-related tests from Cuis fail or don't even terminate on TruffleSqueak: https://github.com/hpi-swa/trufflesqueak/blob/b34dc080fc7b81b559faffe6e5df0798ece97606/src/de.hpi.swa.trufflesqueak.test/src/de/hpi/swa/trufflesqueak/test/runCuisTests.st#L12-L28

So I guess there are plenty of tests that could potentially help find the bug(s) in TruffleSqueak.

isCzech commented 2 years ago

These are the tests added to complement the new (fixed/extended) termination logic introduced last year. They should work with newer Squeak or Cuis images...