part-cw / lambdanative

LambdaNative is a cross-platform development environment written in Scheme, supporting Android, iOS, BlackBerry 10, OS X, Linux, Windows, OpenBSD, NetBSD, FreeBSD and OpenWrt.
http://www.lambdanative.org
Other
1.39k stars 86 forks source link

EVENTLOOP: ensure exclusive Gambit thread exists. #418

Open 0-8-15 opened 3 years ago

0-8-15 commented 3 years ago

At 2020-08-07 Marc Feeley wrote that the situation how lambdanative uses Gambit (at least on Android) is not supported by Gambit, he wrote:

Any number of Scheme threads can call C functions, but only one Scheme thread at a time can do a Scheme to C call that calls back to Scheme. This is because the C stack is shared by all the Scheme to C calls... if the Scheme threads T1 and T2 call C (in the order T1 then T2) and if T1 returns to Scheme while T2's call is still in progress, then T2's C stack frame will be removed when T1's C stack frame is removed (because T2's frame was added after T1's frame).

Under Android both the UI thread and the OpenGL thread call into Gambit; all bets are off. Race conditions are not avoidable. This seems to explain random crashes I observed eventually to the point that I could reproduce them by tapping the screen fast enough.

The only reasonable escape I could conceive was to use Gambit as Marc intented: one single thread running contineously.

The upshot is: for Android and X11 based systems running Gambit from a dedicated thread avoids jitter and event loop related delays for Gambit threads. In practice this work much smoother.

Status

This change will certainly break some applications:

  1. At least those, where race conditions are exacerbated to sure conflicts.
  2. A batch of untested modifications required for compatible with this patch is half prepared already. Notably the 'serial' module is unlikely to work out of the box.
  3. I assume that those other X11 based system should mostly work. Non tested, non available here.

This change avoids many C-to-Scheme calls, which where not properly protected and runs the event loop in a pthread of its own for X11 and Android. Under win32 it still calls the previos way, but used EVENT_IDLE to better wait with the Gambit thread system than block threads in Windows.

For Android the change includes a new NativeGLSurfaceView and supporting GLState class. These work so far more reliable than the version before. But the are premature. Frequently the android app will come up first time after installation but not again. Forced stop of the app and restart helps. Further restarts magically work.

Tested are so far on Android, Linux, Windows.

mgorges commented 3 years ago

'EVENTLOOP: ensure exclusive Gambit thread exists.' is where the actual changes are, correct - everything else are duplicates of things in #417 ? Breaking serial will be a big problem for us as we rely on this. Also, I am personally concerned about breaking clinical data collection systems which use Linux (without GUI) and which will cause irrecoverable data loss if things go bad.

0-8-15 commented 3 years ago

Am Thu, 29 Apr 2021 12:18:19 -0700 schrieb Matthias Görges @.***>:

'EVENTLOOP: ensure exclusive Gambit thread exists.' is where the actual changes are, correct - everything else are duplicates of things in #417 ?

Yes, so to speak. I've been bit surprised to see how githug presents the difference. Or rather lack of presentation of a difference.

This eventloop change is a major change. I'd suggest that there should be a development branch, where this can be tried out until deemed stable.

This eventloop change was developed at top of those minor things in

417. And collected as a branch off from #417. Githib presents this

in one piece.

Breaking serial will be a big problem for us as we rely on this.

I know. That's why I specifically pointed out that there is a risk. I lack a test case, you have it. I gathered the knowledge how to deal with the situation. There should be a way to get it to work again.

The big elephant in the room IMHO is, that there are apps out there, possibly serving critical services, which are all subject to race conditions. I.e., bound to fail.

Also, I am personally concerned about breaking clinical data collection systems which use Linux (without GUI) and which will cause irrecoverable data loss if things go bad.

Do you have a rough theory why, where, how those would be impacted?

For GUI apps things become better, even in the Linux case, because we know now there is only one thread in Gambit and the Gambit threading is not hampered by waits outside of Gambits control.

For non-GUI apps nothing at all is supposed to change. (If it is, that's a bug to be handled.)

mgorges commented 3 years ago

Can't say that I even understand 20% of this new code but I have a few questions/comments:

  1. You disable glgui-timings-at-10msec!, which we use in a lot of apps, including all apps (including non-gui) that use the modules/scheduler/scheduler.scm module and need proper timing? The minimum as per https://github.com/part-cw/lambdanative/pull/418/files#diff-b3b574bd3e4985858199598c697225c69645bbe5420cc7a4402f8e62f5625b04R268 seems to be 0.05 sec sleeps now?
  2. As for your question where glCoreTextureReset is used - it is in glCoreInit
  3. You disable the EVENT_IDLE in event_timer_callback on win32, why? - I think win32 has this annoying behaviour if a GUI app is active but not in focus the main loop will stop running after ~30 seconds (or maybe a minute, can't recall), and when you return to it it throws lots of events and tries to catch up which messes up the output plugins in the scheduler, but this mitigated as introduced in fdc3626e ?
  4. The wrong event code (127 instead of 128) in ANDROID_JAVA_ONDESTROY in https://github.com/part-cw/lambdanative/pull/418/files#diff-2098455a1eec986f28edb431eaec9868112f3f3dc606d46435c07066363a5e11R287 might explain some strange app closing behaviour on Android. Nice catch!
  5. We still need a way to ensure all the other Android things like camera, serial, audio, etc work and I am not sure how to do that short of building this branch of LN and waiting for things to break?
0-8-15 commented 3 years ago

Am Wed, 26 May 2021 20:43:33 -0700 schrieb Matthias Görges @.***>:

Can't say that I even understand 20% of this new code but I have a few questions/comments:

  1. You disable glgui-timings-at-10msec!, which we use in a lot of apps, including all apps (including non-gui) that use the modules/scheduler/scheduler.scm module and need proper timing? The minimum as per https://github.com/part-cw/lambdanative/pull/418/files#diff-b3b574bd3e4985858199598c697225c69645bbe5420cc7a4402f8e62f5625b04R268 seems to be 0.05 sec sleeps now?

OK, so we need to bring that back.

There is no reason to have a 0.05 minimum.

At the other hand: I don't understand what this scheduler is good for and supposed to do.

Personally I have a hard time to envision any need for a 100Hz poll loop. Even for visual feedback it is too fast, for audio it is too slow (let alone that there is usually specialized sound hardware these days).

Perhaps there is a better solution to the problem solved by that scheduler, but as I said: we probably better simply bring it back.

  1. As for your question where glCoreTextureReset is used - it is in glCoreInit

Thanks. So if that's the only case, great!

  1. You disable the EVENT_IDLE in event_timer_callback on win32, why? - I think win32 has this annoying behaviour if a GUI app is active but not in focus the main loop will stop running after ~30 seconds (or maybe a minute, can't recall), and when you return to it it throws lots of events and tries to catch up which messes up the output plugins in the scheduler, but this mitigated as introduced in fdc3626e ?

Because it looked to me as a useless piece of code which I did not understand. So I wanted to see what breaks when I disable it. Now I copy&pasted you comment above into event_timer_callback in order to avoid loss of that knowledge.

  1. We still need a way to ensure all the other Android things like camera, serial, audio, etc work and I am not sure how to do that short of building this branch of LN and waiting for things to break?

Unfortunately I don't have a better suggestion to find out.

For serial I'm pretty sure it will break.

Let me repeat my offer: I'm ready to help fixing the breakage as soon as I can either understand or repeat it.

For serial, this means I need to understand how I could set up a test case or learn what is is supposed to do. (That is: there might be an alternative, which could perhaps be even better: Android is old Linux under the hood. Maybe it's possible to use native device access? I dunno.)