Reworked fix for #2375 - Githubissues

haikusw commented 1 year ago

Last fix was not a fix. Broke multiple copy running test (on macOS at least), despite all tests passing.

haikusw commented 1 year ago

Ok. Reworked this fix and added at least partial coverage for clearing the shared memory on exceptions. Updated concurrent version tests, and added new test for exception handler clearing of shared memory on exception before exiting.

Basic idea was to add a SharedMemoryMutex.release() method that can be called before app exit to clean up the shared memory attachment done by this instance. Then call that method before calling exit and for exception exits (crash).

Implementation in app.py:

add the release() method to SharedMemoryMutex.
update the excepthook implementation to call _shared_memory.release() before calling sys.exit.
remove the incorrect _shared_memory.detach() call in __exit__.
update the end of run(): to call app_exec_(), save the exit_status, call _shared_memory.release(), and then call sys.exit with the saved exit_status.

Then I had to modify the explicit sys.exit(0) in logic.py on line 1431 so that it wouldn't short circuit exit the app and cause app_exec_() to never return and thus our exit clean up of the shared memory not get called.

Changing logic.py line 1431 from sys.exit(0) to QtCore.QCoreApplication.exit(0) results in app_exec_() returning and the cleanup code I added in app.py run(): to call the new _shared_memory.release() method and this appears to fix the issue for real for the clean no-error exit path.

I did have to rework test_logic.py's test_quit_calls_sys_exit(mocked_session) to expect the right thing instead of sys.exit explicitly.

Then I modified the existing tests in app.py:

test_only_running_once() to clean up after itself by calling _shared_memory.release() explicitly since it is a focused test that doesn't run the app event loop.
similarly test_running_twice() had to be modified to also have the first run process clean up after itself when it exits.

Finally, I added a new test to test_app.py test the exception shared memory cleanup handling, test_running_twice_after_exception().

Caveats - there may be other places where sys.exit is being called that short circuits the app_exec() event loop and prevents the shared memory cleanup code from getting called. That will be no worse than the master was before I tried to fix this the first time, but it would be good to track those down and convert them to QtCore.QCoreApplicaton.exit or QApplication.exit, as appropriate for the scope they are in.

Sorry this got a bit more complicated, but I believe it's a more complete fix now since it adds at least some of the exception path cleanup whereas previously it did not.

haikusw commented 1 year ago

Looks like even though make check works here on macOS 10.13 with python 3.8.16, there are other configurations where test_subprocess_run_invalid_utf8() is now failing.

I'll have to look at those a bit later since it's now 5am here.

haikusw commented 1 year ago

@tjguk As noted above and in #2376 - my last attempt (merged to master) wasn't a good fix - it just disabled the prevent running twice code, effectively (and yet the test_running_twice() test still passed?!).

This seems closer to what we want. What do you think?

tjguk commented 1 year ago

Thanks for putting the work in here @haikusw -- I'm about to go away for a week so I might not get to comment substantively on your PR before I go. But I will definitely look at it when I get back. And feel to ping me for an update if I don't!

haikusw commented 1 year ago

@tjguk Thanks for letting me know. Hoping to work on this a bit more anyway because it bothers me that the test_running_twice() test didn't catch my error.

I hope you have safe and smooth travels and a good week away. Cheers.

haikusw commented 1 year ago

The test_running_twice() test did not seem to be doing what it appeared to be trying to do. It also had timing issues in that the subprocesses could run out of order which would make the test fail sometimes (on this older macOS machine at least). Rewrote it so that it's reliable on this mac, but looking to see what CI and others see elsewhere.

This seems about as good as I can make this without more input. I do not think this approach is particularly good because any code that is added that calls sys.exit is going to circumvent the clean up of the shared memory block and require the user to restart their computer (for macOS at least) to be able to run Mu again. This feels brittle.

If the current approach is retained,

I think it would be good to let the user know they have to restart to run Mu again in the case where it crashes and excepthook isn't called to clean up and so Mu finds the shared memory but there is no process with that process id running.
Also, as noted in a commit comment, setup_logging() should probably be changed to always install excepthook instead of only if MU_LOG_TO_STDOUT env var isn't set.

Anyway, we'll see what CI says about my latest changes and, when @tjguk gets back from traveling, perhaps we can discuss this more.

haikusw commented 1 year ago

Weird that this piece of code I didn't change is now causing a test failure…

out: self._shared_memory.data()[:8] = struct.pack("q", os.getpid())

… and only on 2 of 3 PiOS test runners? 🤔

haikusw commented 1 year ago

(yep).

If there was a way to wait until a Popen-ed process had completed launching then we could do that instead of the sleep calls, but I didn't see a way. Can wait until it exits, and do later in the test, but that's not what we want for the middle where we wait to launch cmd2.

haikusw commented 1 year ago

The discussion in #276 seem relevant. In particular, QLockFile seems like it might be a better choice than QSharedMemory since QLockFile will, apparently, auto-cleanup the lock file if it's stale (if you use it correctly).

https://stuff.mit.edu/afs/athena/software/texmaker_v5.0.2/qt57/doc/qtcore/qlockfile.html#details

ntoll commented 1 year ago

@haikusw hey hey - thank you so much for the effort put into this. I'm hoping to make a release of Mu soon (ish) and would love to get this PR into the release. How close is this to getting across the line? Thanks one again.

haikusw commented 1 year ago

Hey @ntoll.

This needs review by someone more experienced with Python and pyQt/Qt. I am a novice when it comes to these technologies and this is not trivial stuff. My lack of knowledge with these technologies and the number of people potentially impacted seems like it suggests a careful review by those more experienced than I. That said, I think it’s good (with the caveats / additional work to consider I noted above), but…. I'm a strong believer in PR reviews.
I noted above some of the limitations in what I’ve done so far and of the general approach that was present (shared memory) which this PR tries to improve. (brittle: any misses of exception hook - such as any added later by developers who don't know not to use it, or missed existing calls to sys.exit - will require a restart by users before they can run Mu again but they have no way to know that; this approach is complex to test), a few more things that could be done to improve it for sure but I asked the question about why you didn’t go with the QLockFile approach mentioned in #276.

I’m not sure it’s worth making the current approach as good as it could be if, ultimately, QLockFile is a better approach. That's why I'm looking for feedback.

If you did want to stick with this approach then it would be good to know that basic some manual testing of this branch on other platforms (Windows/Linux) indicates it seems to work for everyone. Tests all pass and it generally seems to work on macOS but… tests currently pass for main also and that doesn’t actually work (this PR fixes the false positive signaling test).

I'm happy to discuss, answer questions about my changes or additional suggested work, etc, here or elsewhere. Could potentially do some of the additional improvements after a bit of conversation to validate my thinking once I know there's a reason why this won't just all get thrown out due to a move to QLockFile.

haikusw commented 1 year ago

So @carlosperate outlined some important limitations with QFileLock in #276 and that clarifies this alternate approach being worth pursuing further to completion (thank you).

Work I think should be done on this PR to complete it:

Seems Required

[ ] Rework global exception handler hook (excepthook)
- [x] Move installation of it out of setup_logging() since it'll be used for more than logging.
- [x] Exception handler hook needs to always be installed and not be contingent upon the MU_LOG_TO_STDOUT environment variable.
- [x] excepthook needs some rework to add the exclusion of exception logging if the MU_LOG_TO_STDOUT env var is set. Is this accurate??? I'm unclear why the code was written as it was

Would be good

Strongly recommend this, but isn't required I guess.

[ ] Add an Alert when check_only_running_once() says a launched instance has to quit that alerts the user to there being another instance running and that if they are not finding one then they should restart their computer. (yuck, I know, but better than a confused user thinking the app is just broken and crashing on launch).

This seems potentially complex because it sort of "soft runs" the app just to show this alert and then exits but should avoid all the other initialization work that gets started (where?) and that might need to be prevented from running (for example, settings.init() is called immediately after check_only_running_once() is called...).

This complexity of the app startup sequence internals, plus my zero experience with PyQt or Qt and very limited python experience, makes me hesitate to volunteer to implement this particular piece.

Phase two improvement

Later improvement to avoid the user having to restart their computer in the case of a crashed app that didn't clean up the shared memory buffer. This is a proposed mechanism and should be discussed for anything I'm missing/haven't thought of etc:

[ ] Add a similar "auto-recovery" mode to the shared memory multiple-instance exclusion mechanism that the QFileLock has:
- [ ] Add a periodic "I'm still here" timestamp field to the shared memory buffer.
- [ ] Have the registered PID process update that at some TBD interval with the current time.
- [ ] Enhance check_only_running_once() to check this timestamp and to take over the shared memory buffer if too much time had passed. update that only the registered process id would update and the ability for another process to "take over" the shared memory buffer by overwriting the PID field with its PID and to start updating the timestamp and then to allow the app to run.
I don't think this time-stamp and auto-recovery mode addition is necessary to ship the basic feature. That's why the "phase two" of implementing this, but it would be a significant usability improvement in my opinion.

My biggest concerns with the shared memory multiple-instance mechanism as implemented are:

a) The user will not understand why the app is "crashing" on launch (from their perspective) and won't know that to do.

b) This approach feels brittle, as noted above. Any current sys.exit calls or ones that are added in the future are going to skip the global exception hook and thus not clean up the shared buffer and put the user in the stuck situation of a "crashing on launch" Mu app.

If those are acceptable, perhaps by adding the alert telling the user what to do, and if there are ways to reduce the risk of calls to sys.exit being added later (hook it? that seems iffy, but I'm too much a novice on python to know), then I can work on the additions in the "Required** section above.

(Hope that's clear. Medication is making my brain a little foggy today, apologies).

haikusw commented 1 year ago

Do either of you know why the exception handler is not installed by setup_logging() if MU_LOG_TO_STDOUT is set in the environment?

I have changes made (as updated via checkboxes above), but I'm unsure about this detail. TIA.

haikusw commented 1 year ago

(test failure seems to be the timing of the running twice tests on the slow PiOS testers. I can't kick the workflow to get it to try again to see if it was just bad luck or if I need to increase those tests times more).

carlosperate commented 1 year ago

Do either of you know why the exception handler is not installed by setup_logging() if MU_LOG_TO_STDOUT is set in the environment?

I have changes made (as updated via checkboxes above), but I'm unsure about this detail. TIA.

I'm not 100% sure, but because the goal of setting up the MU_LOG_TO_STDOUT env var is to get all the data on the terminal, it also makes sense to get the exception printed in the terminal instead of captured and opened in a browser.

(test failure seems to be the timing of the running twice tests on the slow PiOS testers. I can't kick the workflow to get it to try again to see if it was just bad luck or if I need to increase those tests times more).

Okay, I've restarted the two failed PiOS jobs.

Thanks again for your work on this feature!

haikusw commented 1 year ago

Thanks @carlosperate

I'm not 100% sure, but because the goal of setting up the MU_LOG_TO_STDOUT env var is to get all the data on the terminal, it also makes sense to get the exception printed in the terminal instead of captured and opened in a browser.

Reading the logging documentation, it seems like log handlers added with addHandler are additive (?). If so, as long as setting MU_LOG_TO_STDOUT results in the standard out log handler getting added via addHandler then having the exception logging to all installed handlers should result in that information still showing up on standard out?

The odd thing is that StreamHandler() as setup in setup_logging() doesn't specify stdout as the arguement, so it's going to go to stderr, if the documentation is to be believed (?). To get it to go to stdout one would have to do something like:

stdout_handler = logging.StreamHandler(stream=sys.stdout)

haikusw commented 1 year ago

Ok - added two things:

Now using a dynamic app name for the registration of the multi-instance execution prevention shared memory buffer during automated test running. This means test failures don't forcing the developer to reboot their computer to clear out the one and only shared memory buffer. This is enabled by setting the environment variable MU_TEST_SUPPORT_RANDOM_APP_NAME_EXT and is set for all automated tests via a test fixture in conftest.py
Add an environment variable MU_TEST_SUPPORT_SLOW_RUNNER_FLAG to signal a need to increase process launch waiting times for the multi-instance prevention automated test test_running_twice() in test_app.py because PiOS CI instances take a really long time to launch processes. Simply increasing the timeouts across all test execution targets slowed automated testing everywhere despite only being necessary for PiOS targets. Environment flag is set in GitHub action file test.yml for the test-pios: target/runs only, though it could be set on the command line if someone had a slow machine they were testing on. NOTE: if this test fails on non PiOS machines because I dialed the "normal" test case wait times down too far, they can be increased in test_running_twice() as documented here in app_test.py.

The only outstanding "before merge" issue I'm aware of is the conversation around the excepthook always installed change as discussed above and the follow-on comments.

The other thing I'd like to do is document this multi-instance prevention strategy & implementation somewhere (and these new environment variables I've added), but I haven't had a chance to go figure out where that belongs (pointers gladly accepted!).

haikusw commented 1 year ago

Thanks for putting the work in here @haikusw -- I'm about to go away for a week so I might not get to comment substantively on your PR before I go. But I will definitely look at it when I get back. And feel to ping me for an update if I don't!

hey @tjguk you asked to be pinged if I didn't hear back from you, so here's your ping :)

I need some clarity/feedback on the excepthook always install portion of this (any anything else), as noted in my previous comment, but otherwise I'm hoping this is ready, or close to ready.

I need to wrap my effort on this up - I was just trying to fix the monitor issue I ran into and reported (#2315) and this "side-issue" of fixing up the multi-instance execution prevention code/strategy that blocked me from working on that has turned into a much larger task than the one I came here to do :) I learned a lot though so 👍.

I can document the "future work" noted here as new issues, if you folks would like. And I'm willing to put some time into documenting all this in the technical documentation, but need to figure out where that should go. And, obviously, if this PR needs more work to land based on you folks' feedback, then I'm willing to work on that as well.

Let me know. Thanks. /cc @ntoll

tjguk commented 1 year ago

Thanks for pinging @haikusw -- I'm afraid I managed to catch pneumonia while I was away and ended up in hospital for a few days. So not only have I been playing catch-up ever since: I'm also running a little bit under 100% still. I shall definitely take a look at what you've done; and thank you so much for persevering in this.

haikusw commented 1 year ago

Thanks for pinging @haikusw -- I'm afraid I managed to catch pneumonia while I was away and ended up in hospital for a few days. So not only have I been playing catch-up ever since: I'm also running a little bit under 100% still. I shall definitely take a look at what you've done; and thank you so much for persevering in this.

Yikes! No worries at all. Getting fully well is priority one. Pneumonia can take a while to fully kick. Best of good luck to you on a speedy recovery!

haikusw commented 1 year ago

Looking at this code more, the lack of result checking in the original SharedMemoryMutex acquire function is concerning. Additionally, don't we have to lock the shared memory after attach fails and we call create on line 304 (and don't we have to check for that result before writing to data()?). The call to lock on __enter__ isn't going to work because it hasn't been created yet, seems like.

I started looking at this after reading this in the log for the failed test (which likely just failed due to not enough time, but I haven't confirmed that):

1271

       assert result1 == 0
out: E       assert 1 == 0
out: tests/test_app.py:508: AssertionError
out: ----------------------------- Captured stdout call -----------------------------
out: processOneLaunchTime: 2
out: processTwoLaunchAndExitTime: 4
out: process 2 id: 1065
out: process 1 id: 1064
out: ----------------------------- Captured stderr call -----------------------------
out: /usr/lib/python3/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.14) or chardet (3.0.4) doesn't match a supported version!
out:   RequestsDependencyWarning)
out: ERROR:root:Unrecoverable error
out: Traceback (most recent call last):
out:   File "<string>", line 1, in <module>
out:   File "/home/pi/mu/mu/app.py", line 330, in check_only_running_once
out:     _shared_memory.acquire()
out:   File "/home/pi/mu/mu/app.py", line 305, in acquire
out:     self._shared_memory.data()[:8] = struct.pack("q", os.getpid())
out: TypeError: 'NoneType' object does not support item assignment
out: /usr/lib/python3/dist-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.26.14) or chardet (3.0.4) doesn't match a supported version!
out:   RequestsDependencyWarning)
out: Traceback (most recent call last):
out:   File "<string>", line 1, in <module>
out:   File "/home/pi/mu/mu/app.py", line 330, in check_only_running_once
out:     _shared_memory.acquire()
out:   File "/home/pi/mu/mu/app.py", line 305, in acquire
out:     self._shared_memory.data()[:8] = struct.pack("q", os.getpid())
out: TypeError: 'NoneType' object does not support item assignment
out: =============================== warnings summary ===============================

out: TypeError: 'NoneType' object does not support item assignment

got me started digging. Maybe just my ignorance.

Ok. I have to stop for now. Maybe one of you more experienced folks can review the SharedMemoryMutex code that started all this and see what you think. I think I've moved things forward, but not sure we're there on this functionality yet.

haikusw commented 1 year ago

The PiOS tests aren't passing 100% reliably on CI on my branch when I redo them manually. Going to have to add some more logging to be able to understand what's going on more. Tests sensitive to timing are such a drag. May need to rethink how this test is written entirely. Figure out a way to know 100% that the first process has finished launching and has the shared memory buffer allocated before running the second process. Or maybe the second process runs but spins testing until the first one has created the memory buffer (as a way to know it's running) and then does it's test... Or maybe the second process can launch (detached somehow?) the second process? Something to ponder a bit. May have to put together a small PyQt stand-alone test app to make it easier to work on efficiently. Not today though.

haikusw commented 1 year ago

I was working on removing the timing issues with the tests and ran into a question:

Are you sure that Popen vs subprocess.run opens the processes in separate process trees? They seem to have the same process group id when I execute a test program on macOS 10.12.x is why I ask.

Also, are we sure this is an actual requirement for the QSharedMemory exclusion to work?

import os
import subprocess
import sys
import time

def test_subProcess():
    # matches existing Mu test.
    # Does *not* actually run these two processes in different group ids, 
    # at least not on macOS

    processOneLaunchTime = 0.5
    processTwoLaunchAndExitTime = 1

    cmd1 = "".join(
        (
            "-c",
            "import time;",
            "import os;",
            "pid = os.getpid();",
            "print('process 1 id: {}'.format(pid));",
            "print('process 1 ppid: {}'.format(os.getppid())) if hasattr(os, 'getppid') else '';", 
            "print('process 1 pgid: {}'.format(os.getpgid(pid))) if hasattr(os, 'getpgid') else '';", 
            "time.sleep({0});".format(processTwoLaunchAndExitTime),
            "print('process 1 returning')"
        )
    )
    cmd2 = "".join(
        (
            "-c",
            "import os;",
            "pid = os.getpid();",
            "print('process 2 id: {}'.format(pid));",
            "print('process 2 ppid: {}'.format(os.getppid())) if hasattr(os, 'getppid') else '';", 
            "print('process 2 pgid: {}'.format(os.getpgid(pid))) if hasattr(os, 'getpgid') else '';"
            "print('process 2 returning');"
            "exit(2)"   # simulate blocked second copy of app launch result
        )
    )

    child1 = subprocess.Popen([sys.executable, cmd1])
    # let child 1 fully launch first (required)
    time.sleep(processOneLaunchTime)
    child2 = subprocess.run([sys.executable, cmd2])

    # wait for process 1 to exit _after_ launching process 2
    # and wait until it's done to access return code
    result1 = child1.wait()
    result2 = child2.returncode

    assert result1 == 0
    assert result2 == 2

gives output of:

test_first.py::test_subProcess process 1 id: 24158 process 1 ppid: 24143 process 1 pgid: 24143 process 2 id: 24161 process 2 ppid: 24143 process 2 pgid: 24143 process 2 returning process 1 returning

haikusw commented 1 year ago

I'm closing this in favor of #2415 (though I need someone to try that on Windows to make sure it works there). We can always re-open it if #2415 doesn't work out.

mu-editor / mu

Reworked fix for #2375 #2386