qutebrowser / qutebrowser

A keyboard-driven, vim-like browser based on Python and Qt.
https://www.qutebrowser.org/
GNU General Public License v3.0
9.45k stars 1.01k forks source link

Qt 6.7: Weird IPC timeouts on Windows #8191

Closed The-Compiler closed 1 month ago

The-Compiler commented 1 month ago

Regarding the failure on windows

Had another crack at this today but didn't get anywhere:

Originally posted by @toofar in https://github.com/qutebrowser/qutebrowser/issues/8175#issuecomment-2094046510

The-Compiler commented 1 month ago

I can somewhat reliably reproduce via running qutebrowser and this script:

import time
import json
import getpass
import random
from PyQt6.QtNetwork import QLocalSocket

socket = QLocalSocket()

while True:
    name = f"qutebrowser-{getpass.getuser()}"
    socket.connectToServer(name)
    ok = socket.waitForConnected(1000)
    assert ok

    cmd = random.choice([":version", ":tab-close"])
    json_data = {"args": [cmd], "target_arg": None, "version": "3.1.0", "protocol_version": 1}
    line = json.dumps(json_data) + '\n'
    data = line.encode("utf-8")
    socket.writeData(data)
    ok = socket.waitForBytesWritten(1000)
    if not ok:
        print("!", end="", flush=True)
        continue
    assert socket.error() == QLocalSocket.LocalSocketError.UnknownSocketError
    socket.disconnectFromServer()
    if socket.state() != QLocalSocket.LocalSocketState.UnconnectedState:
        ok = socket.waitForDisconnected(1000)
        assert ok
    print(".", end="", flush=True)
    time.sleep(random.uniform(0.01, 0.2))

It seems to be easier to reproduce when I do some clicking in the UI, e.g. switching tabs.

The logging I added in db3d15d9dac0f8d3ea6a86540c1b970db80badcb might still be a red herring, because the ID is fetched when the timeout function runs, and not when the timeout is started...

The-Compiler commented 1 month ago

My best bet at getting a Windows C++ stack trace by inserting a breakpoint() and letting DebugDiag do its thing:

readline/Python stuff:

``` ntdll!NtDeviceIoControlFile+14 KERNELBASE!ConsoleCallServerGeneric+e9 KERNELBASE!ReadConsoleInternal+18d KERNELBASE!ReadConsoleW+1a python310!PyOS_Readline+542 python310!PyOS_Readline+38d python310!PyOS_Readline+132 python310!Py_gitversion+63ba2 python310!Py_SetStandardStreamEncoding+807 python310!PyEval_EvalFrameDefault+1aa2 python310!PyEval_EvalFrameDefault+f65 python310!PyEval_EvalFrameDefault+f65 python310!PyEval_EvalFrameDefault+f65 python310!PyEval_EvalFrameDefault+f65 python310!PyEval_EvalFrameDefault+f65 python310!PyFunction_Vectorcall+87 python310!PyObject_GC_Malloc+1133 python310!PySys_GetSizeOf+141 python310!PyUnicode_ToUppercase+3ef python310!PyUnicode_ToUppercase+20c python310!PyUnicode_ToUppercase+186 python310!PyUnicode_ToUppercase+325 python310!PyEval_EvalFrameDefault+9338 python310!PyFunction_Vectorcall+87 python310!PyObject_GC_Malloc+1133 python310!Py_HandleSystemExit+80c python310!Py_HandleSystemExit+58f python310!PyVectorcall_Call+5c python310!PyObject_Call+4f QtCore!PyInit_QtCore+eb2f5 QtCore!PyInit_QtCore+eb73a QtCore!PyInit_QtCore+eb7f4 QtCore!PyInit_QtCore+ed7ab Qt6Core!QMetaObject::metacall+41 Qt6Core!QObject::qt_static_metacall+1842 Qt6Core!QMetaObject::activate+84 ```

The interesting part:

Qt6Core!QTimer::timerEvent+cf     
QtCore+66e56      
Qt6Core!QObject::event+64     
QtCore+590f3      
Qt6Widgets!QApplicationPrivate::notify_helper+10e     
Qt6Widgets!QApplication::notify+18c1      
QtWidgets!PyInit_QtWidgets+1c3dcf     
Qt6Core!QCoreApplication::notifyInternal2+11c     
Qt6Core!QEventDispatcherWin32Private::sendTimerEvent+1a6      
Qt6Core!QEventDispatcherWin32::event+1fc      
Qt6Widgets!QApplicationPrivate::notify_helper+10e     
Qt6Widgets!QApplication::notify+18c1      
QtWidgets!PyInit_QtWidgets+1c3dcf     
Qt6Core!QCoreApplication::notifyInternal2+11c     
Qt6Core!QCoreApplicationPrivate::sendPostedEvents+209     
Qt6Gui!QWindowsGuiEventDispatcher::sendPostedEvents+f     
Qt6Core!QEventDispatcherWin32::processEvents+90   
Qt6Gui!QWindowsGuiEventDispatcher::processEvents+19   
Qt6Core!QEventLoop::exec+19f      
Qt6Core!QCoreApplication::exec+15d    

Init stuff:

``` QtWidgets!PyInit_QtWidgets+7c3e5 python310!PyObject_IsTrue+101 python310!PyEval_EvalFrameDefault+3ffe python310!PyFunction_Vectorcall+87 python310!PyEval_EvalFrameDefault+707 python310!PyFunction_Vectorcall+87 python310!PyEval_EvalFrameDefault+1b88 python310!PyFunction_Vectorcall+87 python310!PyEval_EvalFrameDefault+1b88 python310!PyObject_FastCallDictTstate+15d python310!PyEval_EvalCode+82 python310!PyNumber_Negative+45f python310!PyNumber_Negative+35f python310!PyEval_EvalFrameDefault+1aa2 python310!PyFunction_Vectorcall+87 python310!PyEval_EvalFrameDefault+707 python310!PyFunction_Vectorcall+87 python310!PyVectorcall_Call+5c python310!PyObject_Call+4f python310!PyArg_Parse+1ba python310!Py_RunMain+13d python310!Py_RunMain+15 python310!Py_Main+25 python+1230 kernel32!BaseThreadInitThunk+10 ntdll!RtlUserThreadStart+2b ```
The-Compiler commented 1 month ago

I tried to improve the logging in f9ae555c841748ee2586a33fd07e985e3ef08343 and 245d117cd96d7e0b1f79ec41c55fb1c508350468... and everything points to Windows/Qt/PyQt just immediately calling the timeout function after starting the timer.

Why? How?? Why (hopefully!) only this timer? I have no clue! I was unable to find anything that sounded remotely related. No bugreports, no changes that look remotely relevant, nothing.

I think this is user-facing: It happens before the IPC connection ever gets read, and then we disconnect, so we'd (even if rarely) sometimes miss IPC commands.

The best I can come up with is:

@@ -168,6 +168,7 @@ def __init__(self, socketname, parent=None):
         self._timer = usertypes.Timer(self, 'ipc-timeout')
         self._timer.setInterval(READ_TIMEOUT)
         self._timer.timeout.connect(self.on_timeout)
+        self._timer_start_time = None

         if utils.is_windows:  # pragma: no cover
             self._atime_timer = None
@@ -261,6 +262,7 @@ def handle_connection(self):
         log.ipc.debug("Client connected (socket {}).".format(self._socket_id))
         self._socket = socket
         self._timer.start()
+        self._timer_start_time = time.monotonic()
         socket.readyRead.connect(self.on_ready_read)
         if socket.canReadLine():
             log.ipc.debug("We can read a line immediately.")
@@ -393,11 +395,23 @@ def on_ready_read(self):

         if self._socket is not None:
             self._timer.start()
+            self._timer_start_time = time.monotonic()

     @pyqtSlot()
     def on_timeout(self):
         """Cancel the current connection if it was idle for too long."""
         assert self._socket is not None
+        assert self._timer_start_time is not None
+        if (
+            time.monotonic() - self._timer_start_time < READ_TIMEOUT / 1000 / 10
+            and qtutils.version_check("6.7.0", exact=True, compiled=False)
+            and utils.is_windows
+        ):
+            # WORKAROUND for unknown Qt bug (?) where the timer triggers immediately
+            # https://github.com/qutebrowser/qutebrowser/issues/8191
+            log.ipc.debug("Ignoring early on_timeout call")
+            return
+
         log.ipc.error("IPC connection timed out "
                       "(socket {}).".format(self._socket_id))
         self._socket.disconnectFromServer()

I hate it, but I spent the better part of my day with this now, and I can't come up with anything better. I think any further spelunking would require building Qt from source and firing up a C++ debugger to see what happens in Qt... but I'm exhausted and would rather move on to something more interesting.

@toofar thoughts?

toofar commented 1 month ago

That's very strange. Had a look at recent commits to QTimer and QLocalSocket and nothing jumped out at me. I had a thought that maybe it could be due to calls getting interleaved, like if handle_connection() gets called when we are in the middle of reading data from the socket. But there's nothing I could find in search results warning of anything like that and the docs for QTimer and QLocalSocket don't call it out either.

I guess if identifying and ignoring the situation is the best we can do then we can try that and look into it further if it comes up again.

A few questions:

  1. should we be setting singleShot() on this timer? The default is false right? (I tested and it didn't fix the issue though.)
  2. when the timer fires early, what does timer.remainingTime() report?
  3. assuming it is 0, should we be restarting the time when it has fired early?

I also worry that this will affect other timers, since we don't know what causes it. But since we haven't seen this issue on any of the linux test runs and people are already using 6.7 without complaint, I guess we hope it remains this isolated ¯_(ツ)_/¯

The-Compiler commented 1 month ago

That's very strange. Had a look at recent commits to QTimer and QLocalSocket and nothing jumped out at me.

Agreed on both parts. I also looked at changes in qeventdispatcher_win.cpp which has the Windows-specific implementation, and Qt changelogs, and PyQt changelogs, and even diffed PyQt... no dice. Nothing I could find in Qt issues either. I have no idea what's going on there, but it does seem pretty isolated.

I had a thought that maybe it could be due to calls getting interleaved, like if handle_connection() gets called when we are in the middle of reading data from the socket. But there's nothing I could find in search results warning of anything like that and the docs for QTimer and QLocalSocket don't call it out either.

In f9ae555c841748ee2586a33fd07e985e3ef08343 I tried making sure that every connection is a separate object with its own QTimer. I like the result of that refactoring, but for now I want to do something different than poking at IPC stuff. I'll probably open an issue to finish that later™. However, it didn't help either.

I guess if identifying and ignoring the situation is the best we can do then we can try that and look into it further if it comes up again.

Agreed. I want to wait for Qt 6.7.1 before pushing it. It's now planned for next week apparently... but once it's out, I hope we'll be able to easily try it out with PyQt, as it should be binary compatible. This will probably delay v3.2.0 even further, but eh... Qt 6.7.0 seems to have other annoying issues on Windows, so it's probably for the better.

A few questions:

  1. should we be setting singleShot() on this timer? The default is false right? (I tested and it didn't fix the issue though.)

Good point. It shouldn't matter much because in on_timeout we call self._socket.disconnectFromServer(), and that calls the on_disconnected slot, which stops the timer.

  1. when the timer fires early, what does timer.remainingTime() report?

0

  1. assuming it is 0, should we be restarting the time when it has fired early?

See above - in theory yes, but since it's currently a recurring timer, it does already restart anyways.

I also worry that this will affect other timers, since we don't know what causes it. But since we haven't seen this issue on any of the linux test runs and people are already using 6.7 without complaint, I guess we hope it remains this isolated ¯\(ツ)

With 01cc3dfc30dbdad081dfe8fd08d592a885574ee1 and 31ab3431f1ffa7146a0c428c96434fb902951c9e I tried to add a generic check to all timers that logs a warning. The CI run passed. I'm somewhat tempted to commit it and see how it fares over time (hah!). Then again, in case this issue just magically vanishes with Qt 6.7.1 again, we might as well pretend it was never there.

The-Compiler commented 1 month ago

Well, I finally found something! [QTBUG-124496] QObject::timerEvent may be called before the timer elapsed - Qt Bug Tracker

I wonder if we have a single-shot QTimer running somewhere in the IPC code, which Qt confuses with our timeout timer? Or maybe QLocalS(ocket|erver) has one internally?

That also explains why anecdotally, "clicking around" in the UI made it reproduce faster, because that probably leads to more timers being used.

The bug also says Qt 6.5 in addition to 6.7, but the linked bug says 6.5.6 (or .5 in the comments?). So chances are it was a change that landed in Qt 6.7 but also was backported to 6.5 LTS?

Assuming it's still there in Qt 6.7.1, I suppose it might be a good idea to leave the warning logging in.

toofar commented 1 month ago

Ohh, good spotting! So on Windows the timers are backed by some Windows API and somehow the Windows things aren't cleared when the Qt ones are and then a new Qt one will re-use a still active windows thing? Hmm, that doesn't sound right. Does stopping a timer fire a timer event and that's what's being picked up by the re-used one? Maybe we could make a new timer object before killing the old one and that would resolve this? We do stop and start the timer a couple of times in this class.

Anyway, it's reassuring to know it's for sure a Windows only problem. And that Thiago is assigned to the ticket.

I tried to add a generic check to all timers

Damn I saw that and though "wait, we are trying to get a release out, have you heard of scope creep!". Then I looked at the commits and remembered we already have a QTimer wrapper :rofl: Looks pretty reasonable to me.

toofar commented 1 month ago

I pushed some more stuff to your branch! So the current set of stuff on there is:

The changes I made were to

In my opinion that's all good to go, although I'm not hung up on much of it if you think some of it is risky and want to drop it (and I haven't looked at the socket ID stuff at all). Also I probably left a bunch of typos in there as usual. Personally I would edit the commit history to have less incidental stuff in it though ;) (If only so we don't have to skip over commits when running git blame on the ci workflow file.)

I should have some time tomorrow to follow up on this or have a look at pdf.js, so let me know how you want to move forward.