lxqt / qterminal

A lightweight Qt-based terminal emulator
https://lxqt.github.io
GNU General Public License v2.0
611 stars 153 forks source link

replacement character aka `�` appears out of nowhere sometimes #1143

Closed correabuscar closed 5 months ago

correabuscar commented 5 months ago

EDIT2: fixed in qterminal 2.0.1 EDIT: jump to working/tested patch by tsujan here below.

Expected Behavior

this didn't happen in qterminal 1.4.0 or so, iirc so, not randomly replacing chars with sounds like the expected behavior.

Current Behavior

a braille character like :

$ chars ⠿
U+283F, ⠿ 0x283F, \024077, UTF-8: e2 a0 bf, UTF-16BE: 283f
Width: 1, prints as ⠿
Quotes as \u{283f}
Unicode name: BRAILLE PATTERN DOTS-123456

is being shown, sometimes, as ��� note how it's made out of UTF-8: e2 a0 bf, this must be why there's 3 replacement character(s) :

$ chars �
U+FFFD, � 0xFFFD, \0177775, UTF-8: ef bf bd, UTF-16BE: fffd
Width: 1 (2 in CJK context), prints as �
Quotes as \u{fffd}
Unicode name: REPLACEMENT CHARACTER
Possible Solution

I don't know yet, didn't look in the code. Took a while to find reproduction steps.

Steps to Reproduce (for bugs)
  1. try to run this bash script(after you get the two files from below first):

    #!/usr/bin/env -vS -i PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin:${PATH}" LANG="${LANG}" TERM='xterm-256color' DISPLAY="${DISPLAY}" HOME="${HOME}" bash --noprofile --norc
    cp ./qterminal.ini ~/.config/foo_qterminal.conf
    qterminal -p foo_qterminal -e 'less -R --shift 5 -- ./less.txt'

    Note that, apparently the cp line isn't necessary[^1]. You'll need less.txt file which I've generated from nvim via script command... well it doesn't matter I guess: less.txt qterminal.ini.txt (^ you'll have to rename this to remove the .txt extension, couldn't find the .ini to upload here without make it a .txt) put both files in current dir before running the above bash script from current dir.

  2. That should've made the less command open in a new qterminal of 172x54 size(but size doesn't matter apparently, not with less), you can now proceed as follows:

  3. either press Ctrl+L once and you shall see the aforementioned braille char transform into 3 replacement chars (futher Ctrl+L presses have no effect here)

  4. or press right arrow key to move 5 chars to the right and see the same braille char transform into those 3 replacement chars (after this pressing Ctrl+L here would "fix" it)

Here's before step 3 and after step 3 is done: before_ctrl_L after_ctrl_L

Context

btop gets messed up visually, but only in qterminal, not inside say alacritty.

Untitled

System Information

[^1]: so it would work with empty factory-generated profile(on my Gentoo system, but who knows if I've some patches or what not), so I see it work even in a small-sized qterminal the same, but when I tried it with nvim this was definitely not at all reproducible with any other sizes - seems to work now with less.

tsujan commented 5 months ago

I can reproduce it without foo_qterminal and simply by entering less -R --shift 5 -- ./less.txt. It doesn't happen in Konsole.

this didn't happen in qterminal 1.4.0

If so, since apparently nothing was changed in this regard when QTerminal was ported to Qt6 (but that should be checked again), it may be related to Qt6. Although Konsole is also based on Qt6, it uses qt6-5compat, while QTerminal doesn't.

tsujan commented 5 months ago

I don't know yet, didn't look in the code.

If you want to check the code (which would be very appreciated), this is what was done for porting to Qt6: https://github.com/lxqt/qtermwidget/pull/532/files (unfortunately, its commits weren't squashed before merging; hence mentioning the PR and not a commit). Small changes were also made after it, but I don't think they're relevant. It consists of using Qt6 methods and dropping of legacy encodings (QTextCodec).

correabuscar commented 5 months ago

I superficially checked out the PR and whatever was after it as well, but, to by novice eye, nothing popped out as the culprit.

Further, I switched qtermwidget and qterminal repos to git tag 1.4.0 and downgraded dev-util/lxqt-build-tools-2.0.0-r1::gentoo to dev-util/lxqt-build-tools-0.13.0::gentoo to be able to build, and can confirm the issue doesn't exist in 1.4.0. EDIT: actually I'm not sure I'm testing it properly, because it doesn't have a colorscheme(this is due to me specifying LD_LIBRARY_PATH= to point to qtermwidget's build dir, ok i'll use make install DESTDIR= somewhere) so it has white background, this might affect the issue. And left/right arrows don't work at all!ok it works with cmake -DCMAKE_INSTALL_PREFIX=/tmp/qtw1 .. for both repos and make install then run them from there with LD_LIBRARY_PATH="/tmp/qtw1/lib64" So for sure it works on 1.4.0 now that I properly checked!

I'm unclear whether this means that 1.4.0 is using some kind of qt5 even though I've still only dev-qt/qtbase 6.7.2 (maybe it does, it it's backward compatible?), but if not, then maybe something between that tag and the current git master on one or both of the repos could be the culprit?

Looking a bit harder, I do seem to still have something about qt5, so I guess maybe it's using that?

dev-qt/qtcore Installed versions:  5.15.14
dev-qt/qtconcurrent Installed versions:  5.15.14
dev-qt/qtbase Installed versions:  6.7.2

I'll check the diff since 1.4.0 tag

tsujan commented 5 months ago

What I know for sure is that there were characters which Qt5 showed correctly (in any Qt5 app) but Qt6 didn't. That regression is fixed in qt6-base 6.7.1-4 from Arch. Qt 6.7.2 will come soon to my distro (Manjaro), but it's very unlikely that the regression has come back with it (I'll see soon).

tsujan commented 5 months ago

... but if not, then maybe something between that tag and the current git master on one or both of the repos could be the culprit?

Your tests suggest it. I have 1.4.0 on my old laptop and will check it later. So, if it works fine there, we only need to find the needle in the haystack ;)

correabuscar commented 5 months ago

I think you were right though, 1.4.0 working and 2.0.0 not, given the difference in changes, it doesn't seem to be either qterminal or qtermwidget.

I had a superficial look and found, possibly not relevant: https://bugreports.qt.io/browse/QTBUG-117051 but there's something that ah it's removed 2 which strengthens my suspicion that it might be in qt itself. but still, something in there can do this.

That being said, i still haven't checked qterminal's 1.4.0 diff, only qtermwidget's (and superficially) EDIT: done, nothing jumps out to me, but I should note I've never programmed anything with qt before, so I've no idea what's what :D

I'm switching my focus on qt itself

tsujan commented 5 months ago

That being said, i still haven't checked qterminal's 1.4.0 diff, only qtermwidget's (and superficially)

I'll join you soon. If the problem is in our commits, two pairs of eyes couldn't miss it.

correabuscar commented 5 months ago

Progress: It does look related to the issue I mentioned above.

diff --git a/lib/Emulation.cpp b/lib/Emulation.cpp
index c88ac12..a8ed6ea 100644
--- a/lib/Emulation.cpp
+++ b/lib/Emulation.cpp
@@ -224,9 +224,13 @@ void Emulation::receiveData(const char* text, int length)
      * U+10FFFF
      * https://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
      */
+    qDebug() << "!!!!!!!!!! text:" << text;
     QString str = QString::fromUtf8(text, length);
+    qDebug() << "!!!!!!!!!! str:" << str;
     auto encoded = _fromUtf8(str);
-    std::wstring unicodeText = encoded.data.toStdWString(); 
+    qDebug() << "!!!!!!!!!! encoded:" << encoded;
+    std::wstring unicodeText = encoded.data.toStdWString();
+    qDebug() << "!!!!!!!!!! unicodeText:" << QString::fromStdWString(unicodeText.c_str());

     //send characters to terminal emulator
     for (size_t i=0;i<unicodeText.length();i++)

with this patch i can redirect output to file and have export QT_LOGGING_RULES="*=true" (unsure if needed atm), but in that file I see that the very braille char itself which is made out of 3 bytes, ends before the 3rd byte is processed, in that chunk or what. Therefore qt6 thinks it's invalid utf8 and replaces those 2 chars that made it in with replacement char. Then the next chunk, has that 3rd char which is also invalid utf8 (i presume, didn't double check), and it's replaced with replacement char as well.

foo

Well, maybe I jumped the gun a little, as the above shows that 'text' was already having those replacement chars, so the transform didn't happen in this very patch place, but somewhere else, but the idea remains that something split it into chunks that didn't end on utf8 boundary or something, and then something else tried to make the chunks utf8 and thought the last char was invalid utf8 because it was split due to chunking of sorts. You'd know exactly what/why, I'm guessing. Remember I've 0 knowledge of Qt stuff.

tsujan commented 5 months ago

I think that it may be a Qt6 problem which Konsole avoids automatically by using the Qt5 methods provided by qt6-5compat.

tsujan commented 5 months ago

This is what I've done so far.

First I checked QTerminal 1.4.0 and found no issue in this regard. A screenshot:

qterminal

Then I started to compare 1.4.0 and 2.0.0 with each other and focused on Emulation.cpp. The only relevant change I found was in Emulation::receiveData (as you did), and its Qt6 version was OK.

Later I'll compare other files that may be related to this, but I think the problem happens in Emulation::receiveData. If that's the case, the cause should be in Qt6.

correabuscar commented 5 months ago

The only relevant change I found was in Emulation::receiveData (as you did), and its Qt6 version was OK.

the thing is that for some reason the text parameter to receiveData already seems to have the replacement chars set, so that means whatever calls receiveData, passes us the already-broken chunk...

correabuscar commented 5 months ago

the thing is that for some reason the text parameter to receiveData already seems to have the replacement chars set, so that means whatever calls receiveData, passes us the already-broken chunk...

ok I see the wrong assumption here on my part: qDebug() << "!!!!!!!!!! text:" << str; in fact converts the const char* text arg of receiveData into UTF8 before printing it, that's why I see it already as , so it's not that receiveData already receives it like that, but the subsequent UTF8 conversions inside receiveData transform it into (including my qDebug() attempt)

I realized that because I just couldn't get this patch to work(it just wouldn't trigger), heh:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 9cf3516..3533c86 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -153,6 +153,7 @@ set_target_properties( ${QTERMWIDGET_LIBRARY_NAME} PROPERTIES
                        VERSION ${QTERMWIDGET_VERSION}
                      )

+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fexceptions")

 if(APPLE)
     target_compile_definitions(${QTERMWIDGET_LIBRARY_NAME}
diff --git a/lib/Emulation.cpp b/lib/Emulation.cpp
index c88ac12..afc2c1e 100644
--- a/lib/Emulation.cpp
+++ b/lib/Emulation.cpp
@@ -224,9 +224,31 @@ void Emulation::receiveData(const char* text, int length)
      * U+10FFFF
      * https://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
      */
+    qDebug() << "!!!!!!!!!! length:" << length;
+    qDebug() << "!!!!!!!!!! text:" << text;
     QString str = QString::fromUtf8(text, length);
+    qDebug() << "!!!!!!!!!! str:" << str;
     auto encoded = _fromUtf8(str);
-    std::wstring unicodeText = encoded.data.toStdWString(); 
+    qDebug() << "!!!!!!!!!! encoded:" << encoded;
+    std::wstring unicodeText = encoded.data.toStdWString();
+    qDebug() << "!!!!!!!!!! unicodeText:" << QString::fromStdWString(unicodeText.c_str());
+    if (length >= 3 &&
+        ((text[length - 3] == '\xEF' &&
+        text[length - 2] == '\xBF' &&
+        text[length - 1] == '\xBD')||(text[0]=='\xEF' && text[1]=='\xBF' && text[2]=='\xBD'))) {
+
+        qDebug() << "Replacement character detected at the end of the input.";
+
+        // Option 1: Print stack trace
+        //printStackTrace();
+
+        // Option 2: Crash the program
+        // This can be done using assertions or by throwing an exception
+        // Q_ASSERT(false); // This will crash the program with a debug assertion
+        throw std::runtime_error("Replacement character detected at the end of the input. Crashing as requested.");
+        // Option 2: Crash the program using assertion
+        Q_ASSERT(false); // This will crash the program with a debug assertion
+    }

     //send characters to terminal emulator
     for (size_t i=0;i<unicodeText.length();i++)

therefore you may be right indeed the cause should be in Qt6.

However, I'm unclear what calls receiveData and why would it have worked before in 1.4.0 if the 'text' arg doesn't end in utf8 boundary. Will explore more. EDIT: Hmm, I'm thinking in qt5 there was no replacement? and that's why it worked with splitting on non-utf8 char boundary ? because the original bytes were left intact? unclear

correabuscar commented 5 months ago

I've a stacktrace with who's calling receiveData:

(gdb) bt2
executing: 'frame apply all -q frame'
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
44        return INTERNAL_SYSCALL_ERROR_P (ret) ? INTERNAL_SYSCALL_ERRNO (ret) : 0;
#1  0x00007ffff5ea9513 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:78
78    return __pthread_kill_implementation (threadid, signo, 0);
#2  0x00007ffff5e5164e in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
26    int ret = __pthread_kill (__pthread_self (), sig);
#3  0x00007ffff5e399c5 in __GI_abort () at abort.c:79
79        raise (SIGABRT);
#4  0x00007ffff60a6198 in __gnu_cxx::__verbose_terminate_handler () at /usr/src/debug/sys-devel/gcc-14.1.1_p20240622/gcc-14-20240622/libstdc++-v3/libsupc++/vterminate.cc:95
95      abort();
#5  0x00007ffff60bfa3d in __cxxabiv1::__terminate (handler=<optimized out>) at /usr/src/debug/sys-devel/gcc-14.1.1_p20240622/gcc-14-20240622/libstdc++-v3/libsupc++/eh_terminate.cc:48
48        handler ();
#6  0x00007ffff60a5b0f in std::terminate () at /usr/src/debug/sys-devel/gcc-14.1.1_p20240622/gcc-14-20240622/libstdc++-v3/libsupc++/eh_terminate.cc:58
58    __cxxabiv1::__terminate (get_terminate ());
#7  0x00007ffff60bfe5c in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x7ffff629ddd8 <typeinfo for std::runtime_error>, dest=0x7ffff60d8b50 <std::runtime_error::~runtime_error()>) at /usr/src/debug/sys-devel/gcc-14.1.1_p20240622/gcc-14-20240622/libstdc++-v3/libsupc++/eh_throw.cc:98
98    std::terminate ();
#8  0x00007ffff7f34220 in Konsole::Emulation::receiveData (this=<optimized out>, text=text@entry=0x555555dba100 "\240\277\"\033[m\033[38;2;204;204;204m\033[48;2;0;0;0m, \033[m\033[38;2;179;246;192m\033[48;2;0;0;0m\"⢿\"\033[m\033[38;2;204;204;204m\033[48;2;0;0;0m,\033[m\r\n", ' ' <repeats 12 times>, "\033[m\033[38;2;179;246;192m\033[48;2;0;0;0m\"⡇\"\033[m\033[38;2;204;204;204m\033[48;"..., length=length@entry=1027) at /tmp/qtermwidget/lib/Emulation.cpp:247
247         throw std::runtime_error("Replacement character detected at the front of the input. Crashing as requested.");
#9  0x00007ffff7f7369b in Konsole::Session::onReceiveBlock (this=0x5555558d96f0, buf=0x555555dba100 "\240\277\"\033[m\033[38;2;204;204;204m\033[48;2;0;0;0m, \033[m\033[38;2;179;246;192m\033[48;2;0;0;0m\"⢿\"\033[m\033[38;2;204;204;204m\033[48;2;0;0;0m,\033[m\r\n", ' ' <repeats 12 times>, "\033[m\033[38;2;179;246;192m\033[48;2;0;0;0m\"⡇\"\033[m\033[38;2;204;204;204m\033[48;"..., len=1027) at /tmp/qtermwidget/lib/Session.cpp:902
902     _emulation->receiveData( buf, len );
#10 0x00007ffff67485b8 in doActivate<false> (sender=0x5555556fe530, signal_index=16, argv=0x7fffffffe170) at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2/src/corelib/kernel/qobject.cpp:4098
4098                        callFunction(receiver, QMetaObject::InvokeMetaMethod, method_relative, argv);
#11 0x00007ffff66f8ce9 in QMetaObject::activate (sender=sender@entry=0x5555556fe530, m=m@entry=0x7ffff7fb8100 <Konsole::Pty::staticMetaObject>, local_signal_index=local_signal_index@entry=0, argv=argv@entry=0x7fffffffe170) at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2/src/corelib/kernel/qobject.cpp:4146
4146            doActivate<false>(sender, signal_index, argv);
#12 0x00007ffff7f952aa in Konsole::Pty::receivedData (this=this@entry=0x5555556fe530, _t1=<optimized out>, _t2=<optimized out>) at /tmp/qtermwidget/build/lib/moc_Pty.cpp:178
178     QMetaObject::activate(this, &staticMetaObject, 0, _a);
#13 0x00007ffff7f64ef0 in Konsole::Pty::dataReceived (this=0x5555556fe530) at /tmp/qtermwidget/lib/Pty.cpp:323
323     emit receivedData(data.constData(),data.size());
#14 0x00007ffff67485b8 in doActivate<false> (sender=0x5555556fe220, signal_index=3, argv=0x7fffffffe248) at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2/src/corelib/kernel/qobject.cpp:4098
4098                        callFunction(receiver, QMetaObject::InvokeMetaMethod, method_relative, argv);
#15 0x00007ffff66f8ce9 in QMetaObject::activate (sender=sender@entry=0x5555556fe220, m=m@entry=0x7ffff6a31de0 <QIODevice::staticMetaObject>, local_signal_index=local_signal_index@entry=0, argv=argv@entry=0x0) at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2/src/corelib/kernel/qobject.cpp:4146
4146            doActivate<false>(sender, signal_index, argv);
#16 0x00007ffff675772b in QIODevice::readyRead (this=this@entry=0x5555556fe220) at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2_build/src/corelib/Core_autogen/include/moc_qiodevice.cpp:211
211     QMetaObject::activate(this, &staticMetaObject, 0, nullptr);
#17 0x00007ffff7f62fb4 in KPtyDevicePrivate::_k_canRead (this=0x5555556f4c50) at /tmp/qtermwidget/lib/kptydevice.cpp:146
146             emit q->readyRead();
#18 0x00007ffff7f94fc9 in KPtyDevice::qt_static_metacall (_c=QMetaObject::InvokeMetaMethod, _id=1, _o=<optimized out>, _a=0x7fffffffe430) at /tmp/qtermwidget/build/lib/moc_kptydevice.cpp:105
105         case 1: { bool _r = _t->d_func()->_k_canRead();
#19 KPtyDevice::qt_static_metacall (_o=<optimized out>, _c=<optimized out>, _id=<optimized out>, _a=0x7fffffffe430) at /tmp/qtermwidget/build/lib/moc_kptydevice.cpp:98
98  void KPtyDevice::qt_static_metacall(QObject *_o, QMetaObject::Call _c, int _id, void **_a)
#20 0x00007ffff67485b8 in doActivate<false> (sender=0x5555556fe200, signal_index=5, argv=0x7fffffffe430) at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2/src/corelib/kernel/qobject.cpp:4098
4098                        callFunction(receiver, QMetaObject::InvokeMetaMethod, method_relative, argv);
#21 0x00007ffff66f8ce9 in QMetaObject::activate (sender=sender@entry=0x5555556fe200, m=m@entry=0x7ffff6a2f620 <QSocketNotifier::staticMetaObject>, local_signal_index=local_signal_index@entry=2, argv=argv@entry=0x7fffffffe430) at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2/src/corelib/kernel/qobject.cpp:4146
4146            doActivate<false>(sender, signal_index, argv);
#22 0x00007ffff66f96a3 in QSocketNotifier::activated (this=this@entry=0x5555556fe200, _t1=<optimized out>, _t2=...) at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2_build/src/corelib/Core_autogen/include/moc_qsocketnotifier.cpp:202
202     QMetaObject::activate(this, &staticMetaObject, 2, _a);
#23 0x00007ffff66f97ea in QSocketNotifier::event (this=0x5555556fe200, e=<optimized out>) at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2/src/corelib/kernel/qsocketnotifier.h:82
82      Q_DECL_CONSTEXPR operator DescriptorType() const noexcept { return sockfd; }
#24 0x00007ffff7a1c7ed in QApplicationPrivate::notify_helper (this=<optimized out>, receiver=0x5555556fe200, e=0x7fffffffe5a0) at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2/src/widgets/kernel/qapplication.cpp:3287
3287        consumed = receiver->event(e);
#25 0x00007ffff676578c in QCoreApplication::notifyInternal2 (receiver=0x5555556fe200, event=0x7fffffffe5a0) at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2/src/corelib/kernel/qcoreapplication.cpp:1142
1142        return self->notify(receiver, event);
#26 0x00007ffff6765924 in QCoreApplication::sendEvent (receiver=<optimized out>, event=<optimized out>) at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2/src/corelib/kernel/qcoreapplication.cpp:1583
1583        return notifyInternal2(receiver, event);
#27 0x00007ffff6503b2e in socketNotifierSourceDispatch (source=0x555555664770) at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2/src/corelib/kernel/qeventdispatcher_glib.cpp:75
75              QCoreApplication::sendEvent(p->socketNotifier, &event);
#28 0x00007ffff5d18d36 in g_main_dispatch (context=context@entry=0x7ffff0000ef0) at ../glib-2.78.6/glib/gmain.c:3476
3476              need_destroy = !(* dispatch) (source, callback, user_data);
#29 0x00007ffff5d1caa8 in g_main_context_dispatch_unlocked (context=0x7ffff0000ef0) at ../glib-2.78.6/glib/gmain.c:4284
4284          g_main_dispatch (context);
#30 g_main_context_iterate_unlocked (context=context@entry=0x7ffff0000ef0, block=block@entry=1, dispatch=dispatch@entry=1, self=<optimized out>) at ../glib-2.78.6/glib/gmain.c:4349
4349        g_main_context_dispatch_unlocked (context);
#31 0x00007ffff5d1d3b6 in g_main_context_iteration (context=0x7ffff0000ef0, may_block=1) at ../glib-2.78.6/glib/gmain.c:4414
4414      retval = g_main_context_iterate_unlocked (context, may_block, TRUE, G_THREAD_SELF);
#32 0x00007ffff64ed625 in QEventDispatcherGlib::processEvents (this=0x55555565e900, flags=...) at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2/src/corelib/kernel/qeventdispatcher_glib.cpp:394
394     bool result = g_main_context_iteration(d->mainContext, canWait);
#33 0x00007ffff67927fa in QEventLoop::exec (this=this@entry=0x7fffffffe7f0, flags=..., flags@entry=...) at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2/src/corelib/global/qflags.h:34
34      constexpr inline Q_IMPLICIT operator uint() const noexcept { return uint(i); }
#34 0x00007ffff67929ac in QCoreApplication::exec () at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2/src/corelib/global/qflags.h:74
74      constexpr inline Q_IMPLICIT QFlags(Enum flags) noexcept : i(Int(flags)) {}
#35 0x00007ffff6d9460d in QGuiApplication::exec () at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2/src/gui/kernel/qguiapplication.cpp:1926
1926        return QCoreApplication::exec();
#36 0x00007ffff7980930 in QApplication::exec () at /usr/src/debug/dev-qt/qtbase-6.7.2/qtbase-everywhere-src-6.7.2/src/widgets/kernel/qapplication.cpp:2555
2555        return QGuiApplication::exec();
#37 0x000055555557e615 in main (argc=<optimized out>, argv=<optimized out>) at /tmp/qterminal/src/main.cpp:219
219     int ret = app->exec();
(gdb)

I used this patch:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 9cf3516..3533c86 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -153,6 +153,7 @@ set_target_properties( ${QTERMWIDGET_LIBRARY_NAME} PROPERTIES
                        VERSION ${QTERMWIDGET_VERSION}
                      )

+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fexceptions")

 if(APPLE)
     target_compile_definitions(${QTERMWIDGET_LIBRARY_NAME}
diff --git a/lib/Emulation.cpp b/lib/Emulation.cpp
index c88ac12..5cfc653 100644
--- a/lib/Emulation.cpp
+++ b/lib/Emulation.cpp
@@ -224,9 +224,30 @@ void Emulation::receiveData(const char* text, int length)
      * U+10FFFF
      * https://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
      */
+    qDebug() << "!!!!!!!!!! length:" << length;
+    qDebug() << "!!!!!!!!!! text:" << text;
     QString str = QString::fromUtf8(text, length);
+    qDebug() << "!!!!!!!!!! str:" << str;
     auto encoded = _fromUtf8(str);
-    std::wstring unicodeText = encoded.data.toStdWString(); 
+    qDebug() << "!!!!!!!!!! encoded:" << encoded;
+    std::wstring unicodeText = encoded.data.toStdWString();
+    qDebug() << "!!!!!!!!!! unicodeText:" << QString::fromStdWString(unicodeText.c_str());
+    if (length >= 3 &&
+        (
+         /*(str[length - 3] == '\xEF' &&
+        str[length - 2] == '\xBF' &&
+        str[length - 1] == '\xBD')
+         ||*/
+         //(str[0]=='\xEF' && str[1]=='\xBF' && str[2]=='\xBD'))
+    //str.contains(QChar::ReplacementCharacter))
+    str[0]==QChar::ReplacementCharacter)
+        ) {
+
+        qDebug() << "Replacement character detected at the front of the input.";
+        throw std::runtime_error("Replacement character detected at the front of the input. Crashing as requested.");
+        // Option 2: Crash the program using assertion
+        Q_ASSERT(false); // This will crash the program with a debug assertion
+    }

     //send characters to terminal emulator
     for (size_t i=0;i<unicodeText.length();i++)

then ran this:

#!/usr/bin/env -S -i PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin:${PATH}" LANG="${LANG}" TERM='xterm-256color' DISPLAY="${DISPLAY}" HOME="${HOME}" bash --noprofile --norc

cp ./qterminal.ini ~/.config/foo_qterminal.conf
declare -x XDG_RUNTIME_DIR="/run/user/1000"
export QT_LOGGING_RULES="*=true"
#export QT_LOGGING_RULES="qt.*=true"
#qterminal -p foo_qterminal -e 'less -R --shift 5 -- ./less.txt'
#LD_LIBRARY_PATH="/var/tmp/portage/x11-libs/qtermwidget-2.0.0/work/qtermwidget-2.0.0_build" qterminal -p foo_qterminal -e 'less -R --shift 5 -- ./less.txt'
#LD_LIBRARY_PATH="/tmp/qtw1/usr/local/lib64" /tmp/qterminal/build/qterminal -p foo_qterminal -e 'less -R --shift 5 -- ./less.txt'
#LD_LIBRARY_PATH="/tmp/qtw1" /tmp/qterminal/build/qterminal -p foo_qterminal -e 'less -R --shift 5 -- ./less.txt'
#export XDG_RUNTIME_DIR="/run/user/1000"
#/tmp/qterminal/build/qterminal -p foo_qterminal -e 'less -R --shift 5 -- ./less.txt'
#/tmp/qterminal/build/qterminal -p foo_qterminal -e 'less -R --shift 5 -- ./less.txt'
#LD_LIBRARY_PATH="/tmp/qtw1/lib64" /tmp/qtw1/bin/qterminal -p foo_qterminal -e 'less -R --shift 5 -- ./less.txt'
#LD_LIBRARY_PATH="/tmp/qtw2/lib64" /tmp/qtw2/bin/qterminal -p foo_qterminal -e 'less -R --shift 5 -- ./less.txt'

echo "set environment LD_LIBRARY_PATH /tmp/qtw2/lib64" > /tmp/gdb_cmds.txt
echo "run -p foo_qterminal -e 'less -R --shift 5 -- ./less.txt' > /tmp/foo7.log 2>&1 " >> /tmp/gdb_cmds.txt
gdb -x /tmp/gdb_cmds.txt /tmp/qtw2/bin/qterminal

then held down Ctrl+L inside that less that started in qterminal Ah and qterminal and qtermwidged were both built like:

git cloned ... chdir into it
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/tmp/qtw2 ..
make
make install

A more robust patch which wouldn't trigger if the input has replacement chars, might be:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 9cf3516..3533c86 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -153,6 +153,7 @@ set_target_properties( ${QTERMWIDGET_LIBRARY_NAME} PROPERTIES
                        VERSION ${QTERMWIDGET_VERSION}
                      )

+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fexceptions")

 if(APPLE)
     target_compile_definitions(${QTERMWIDGET_LIBRARY_NAME}
diff --git a/lib/Emulation.cpp b/lib/Emulation.cpp
index c88ac12..31cbdab 100644
--- a/lib/Emulation.cpp
+++ b/lib/Emulation.cpp
@@ -208,6 +208,23 @@ void Emulation::sendMouseEvent(int /*buttons*/, int /*column*/, int /*row*/, int
     // default implementation does nothing
 }

+int countReplacementCharactersInText(const char* text, int length) {
+    int count = 0;
+    for (int i = 0; i < length - 2; ++i) {
+        if (static_cast<unsigned char>(text[i]) == 0xEF &&
+            static_cast<unsigned char>(text[i + 1]) == 0xBF &&
+            static_cast<unsigned char>(text[i + 2]) == 0xBD) {
+            count++;
+            i += 2; // Skip the next two bytes
+        }
+    }
+    return count;
+}
+
+int countReplacementCharactersInString(const QString& str) {
+    return str.count(QChar::ReplacementCharacter);
+}
+
 /*
    We are doing code conversion from locale to unicode first.
 TODO: Character composition from the old code.  See #96536
@@ -224,9 +241,41 @@ void Emulation::receiveData(const char* text, int length)
      * U+10FFFF
      * https://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
      */
+    qDebug() << "!!!!!!!!!! length:" << length;
+    qDebug() << "!!!!!!!!!! text:" << text;
     QString str = QString::fromUtf8(text, length);
+    qDebug() << "!!!!!!!!!! str:" << str;
     auto encoded = _fromUtf8(str);
-    std::wstring unicodeText = encoded.data.toStdWString(); 
+    qDebug() << "!!!!!!!!!! encoded:" << encoded;
+    std::wstring unicodeText = encoded.data.toStdWString();
+    qDebug() << "!!!!!!!!!! unicodeText:" << QString::fromStdWString(unicodeText.c_str());
+//    if (length >= 3 &&
+//        (
+//         /*(str[length - 3] == '\xEF' &&
+//        str[length - 2] == '\xBF' &&
+//        str[length - 1] == '\xBD')
+//         ||*/
+//         //(str[0]=='\xEF' && str[1]=='\xBF' && str[2]=='\xBD'))
+//    //str.contains(QChar::ReplacementCharacter))
+//    str[0]==QChar::ReplacementCharacter)
+//        ) {
+//
+//        qDebug() << "Replacement character detected at the front of the input.";
+//        throw std::runtime_error("Replacement character detected at the front of the input. Crashing as requested.");
+//        // Option 2: Crash the program using assertion
+//        Q_ASSERT(false); // This will crash the program with a debug assertion
+//    }
+    // Count replacement characters in 'text'
+    int textReplacementCount = countReplacementCharactersInText(text, length);
+    // Count replacement characters in 'str'
+    int strReplacementCount = countReplacementCharactersInString(str);
+
+    // Compare the counts
+    if (strReplacementCount > textReplacementCount) {
+        qDebug() << "More replacement characters found in 'str' than in 'text'.";
+        throw std::runtime_error("Something got replaced with replacement character. Crashing as requested.");
+        Q_ASSERT(false);//not reached
+    }

     //send characters to terminal emulator
     for (size_t i=0;i<unicodeText.length();i++)

So it would only trigger upon detecting replacements with replacement chars. Only tested on non-replacement chars input, though.

correabuscar commented 5 months ago

This is the qt6 code where the replacement is made: https://github.com/qt/qtbase/blob/c0afb1ae836b779fbf0b0b60b7c55517e2e2f00f/src/corelib/text/qstringconverter.cpp#L663

I've patched out that 'if' and can confirm it's replacing the braille char with nothing on screen. So that's the code.

But I think this might be solvable by qterminal/qtermwidget if someone figures out how that receiveData is called(I assume it's not called by qt6 but I may be wrong here, even though I did show the stacktrace above) and then maybe do a tentative QString str = QString::fromUtf8(text, length); before actually sending the data, then check if the number of replacement chars in str is higher than those in text then know that replacements were made (just like in my above comment, at the end, that patch - can do the counting) - or something simpler and just check if the end 1 to 3 utf8 chars (since valid utf8 is max 4 bytes, thus 3 of 4 would be potentially invalid when split and each of those max. 3 bytes gets replaced by a full utf8 replacement char aka ) are replacement chars, and thus then maybe try to send the buffer minus at most those 3 bytes at the end, which would be presumably replaced when converted to utf8(with replacement chars), and remember to ensure that next buffer to send contains these unsent ones at the beginning. So that then, the received buffer is always on utf8 boundary and qt6 won't be replacing anything.

I've no idea how this actually works, but definitely ensuring(somehow) that the send buffer is split only on utf8 char boundaries would prevent qt6 from spewing replacement chars.

I'm guessing somehow 1.4.0 qt5 didn't replace those chars? else I can't explain how it worked before.

EDIT: A first look (superficial) suggests that the reason it worked in 1.4.0 is because it was using a converter with internal state like QString QUtf8::convertToUnicode(const char *chars, int len, QTextCodec::ConverterState *state), but in qt6 it doesn't. I haven't really double checked this, so I might be wrong.

I followed it to that code (unless I made a mistake along the way), which is here: https://github.com/qt/qtbase/blob/86c62c8f6088ec148512457cb7e964661ba643b0/src/corelib/codecs/qutfcodec.cpp#L543

But if it did somehow keep internal state, as per the comment, it suggests it would recognize valid continuation(for example) if it was split in chunks like it is above in receiveData:

   // See above for buffer requirements for stateless decoding. However, that
    // fails if the state is not empty. The following situations can add to the
    // requirements:
    //  state contains      chars starts with           requirement
    //   1 of 2 bytes       valid continuation          0
    //   2 of 3 bytes       same                        0
    //   3 bytes of 4       same                        +1 (need to insert surrogate pair)
    //   1 of 2 bytes       invalid continuation        +1 (need to insert replacement and restart)
    //   2 of 3 bytes       same                        +1 (same)
    //   3 of 4 bytes       same                        +1 (same)
    QString result(len + 1, Qt::Uninitialized);
correabuscar commented 5 months ago

Wow I made a discovery, the replacement chars exist in receiveData even in 1.4.0, but somehow they didn't make it on screen. Will update this with the patch to test this.

Ok I jumped the gun a little. What I meant to say is that using the same utf8 conversion like qtermwidget uses in version 2, even tho using qt5, the same replacement chars can be seen.

In other words, this change is really the culprit:

diff --git a/lib/Emulation.cpp b/lib/Emulation.cpp
index 723f5d2..c88ac12 100644
--- a/lib/Emulation.cpp
+++ b/lib/Emulation.cpp
@@ -250,8 +224,9 @@ void Emulation::receiveData(const char* text, int length)
      * U+10FFFF
      * https://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
      */
-    QString utf16Text = _decoder->toUnicode(text,length);
-    std::wstring unicodeText = utf16Text.toStdWString();
+    QString str = QString::fromUtf8(text, length);
+    auto encoded = _fromUtf8(str);
+    std::wstring unicodeText = encoded.data.toStdWString();

     //send characters to terminal emulator
     for (size_t i=0;i<unicodeText.length();i++)

because that QString::fromUtf8(text, length); is stateless and it thus has no choice but to transform any byte at the end of the chunk that isn't valid utf8 into the replacement char, while the previous way of doing it, was keeing a state (presumably) inside that _decoder.

This patch to test on 1.4.0:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index ce50ed0..1747931 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -153,6 +153,7 @@ set_target_properties( ${QTERMWIDGET_LIBRARY_NAME} PROPERTIES
                        VERSION ${QTERMWIDGET_VERSION}
                      )

+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fexceptions")

 if(APPLE)
     target_compile_definitions(${QTERMWIDGET_LIBRARY_NAME}
diff --git a/lib/Emulation.cpp b/lib/Emulation.cpp
index 723f5d2..19afe3a 100644
--- a/lib/Emulation.cpp
+++ b/lib/Emulation.cpp
@@ -38,6 +38,7 @@
 #include <QThread>

 #include <QTime>
+#include <QDebug>

 // KDE
 //#include <kdebug.h>
@@ -239,6 +240,23 @@ void Emulation::sendMouseEvent(int /*buttons*/, int /*column*/, int /*row*/, int
 TODO: Character composition from the old code.  See #96536
 */

+int countReplacementCharactersInText(const char* text, int length) {
+    int count = 0;
+    for (int i = 0; i < length - 2; ++i) {
+        if (static_cast<unsigned char>(text[i]) == 0xEF &&
+            static_cast<unsigned char>(text[i + 1]) == 0xBF &&
+            static_cast<unsigned char>(text[i + 2]) == 0xBD) {
+            count++;
+            i += 2; // Skip the next two bytes
+        }
+    }
+    return count;
+}
+
+int countReplacementCharactersInString(const QString& str) {
+    return str.count(QChar::ReplacementCharacter);
+}
+
 void Emulation::receiveData(const char* text, int length)
 {
     emit stateSet(NOTIFYACTIVITY);
@@ -250,8 +268,24 @@ void Emulation::receiveData(const char* text, int length)
      * U+10FFFF
      * https://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
      */
+    qDebug() << "!!!!!!!!!! length:" << length;
+    qDebug() << "!!!!!!!!!! text:" << text; //yes replacement chars here because it get converted to utf8
+    QString str = QString::fromUtf8(text, length);
+    qDebug() << "!!!!!!!!!! str:" << str; //yes replacement chars here
     QString utf16Text = _decoder->toUnicode(text,length);
+    qDebug() << "!!!!!!!!!! utf16Text:" << utf16Text;//no repl. chars here, gets truncated before it
     std::wstring unicodeText = utf16Text.toStdWString();
+    qDebug() << "!!!!!!!!!! unicodeText:" << QString::fromStdWString(unicodeText.c_str());//none here too
+    // Count replacement characters in 'text'
+    int textReplacementCount = countReplacementCharactersInText(text, length);
+    // Count replacement characters in 'str'
+    int strReplacementCount = countReplacementCharactersInString(str);
+
+    // Compare the counts
+    if (strReplacementCount > textReplacementCount) {
+        qDebug() << "More replacement characters found in 'str' than in 'text'.";
+        throw std::runtime_error("Something got replaced with replacement character. Crashing as requested.");
+    }

     //send characters to terminal emulator
     for (size_t i=0;i<unicodeText.length();i++)

in 1.4.0 it is ignoring the incomplete utf8 char at the end, and on next chunk it's somehow[^1] able to see the correct whole braille char, as if it remembered that it ignored the first byte of it and considered it when processing the first 2 bytes of the next chunk which are the last 2 bytes of the braille char.

foo2

I think maybe that _decoder field is keeping track of what's seen before, which means, it sees the first byte (of the 3) of the braille char at the end of a chunk and it ignores it but keeps track of its state, internally, then when second chunk is given to it, then it uses what's seen(that 1 byte) and thus reconstructs the braille char at the beginning of this second chunk. And that's why it works in 1.4.0 . But in 2.0.0 since it's stateless, well it can't remember that it's seen a first byte in a prev. chunk, so it has to replacement char all of the 3 because they're split into two different calls. So I guess this means that qtermwidget would have to keep track of at most 3 bytes at the end of the chunk of text(+length) that it received in receiveData, so that on the next chunk can use those when processing the start of that chunk, and thus reconstruct whatever utf8 char was split into two chunks/calls. Or, get back to using _decoder as field and it presumably keeps track of this. Therefore the problem isn't really in qt6.

[^1]: I think this is because of that internal state maybe?! which I mentioned in the edit at the bottom of prev comment.

correabuscar commented 5 months ago

Here I have a proof-of-concept patch that does what I suspect _decoder did in 1.4.0, but this is for 2.0.0, that is: it keeps track of the last 1 to 3 bytes of the chunk(of text argument) if they're not valid utf8 and reuses those in the next chunk, by prepending them at the beginning.

It's very crappy patch, but as a p.o.c. it should do, to show how 2.0.0 can be made to work, if we keep track of the state just like I presume _decoder did in 1.4.0.

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 9cf3516..3533c86 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -153,6 +153,7 @@ set_target_properties( ${QTERMWIDGET_LIBRARY_NAME} PROPERTIES
                        VERSION ${QTERMWIDGET_VERSION}
                      )

+set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fexceptions")

 if(APPLE)
     target_compile_definitions(${QTERMWIDGET_LIBRARY_NAME}
diff --git a/lib/Emulation.cpp b/lib/Emulation.cpp
index c88ac12..ed9c7f1 100644
--- a/lib/Emulation.cpp
+++ b/lib/Emulation.cpp
@@ -25,6 +25,7 @@
 // System
 #include <cstdio>
 #include <cstdlib>
+#include <cstring> // for memcpy
 #include <unistd.h>
 #include <string>

@@ -208,6 +209,23 @@ void Emulation::sendMouseEvent(int /*buttons*/, int /*column*/, int /*row*/, int
     // default implementation does nothing
 }

+int countReplacementCharactersInText(const char* text, int length) {
+    int count = 0;
+    for (int i = 0; i < length - 2; ++i) {
+        if (static_cast<unsigned char>(text[i]) == 0xEF &&
+            static_cast<unsigned char>(text[i + 1]) == 0xBF &&
+            static_cast<unsigned char>(text[i + 2]) == 0xBD) {
+            count++;
+            i += 2; // Skip the next two bytes
+        }
+    }
+    return count;
+}
+
+int countReplacementCharactersInString(const QString& str) {
+    return str.count(QChar::ReplacementCharacter);
+}
+
 /*
    We are doing code conversion from locale to unicode first.
 TODO: Character composition from the old code.  See #96536
@@ -219,14 +237,105 @@ void Emulation::receiveData(const char* text, int length)

     bufferedUpdate();

-    /* XXX: the following code involves encoding & decoding of "UTF-16
-     * surrogate pairs", which does not work with characters higher than
-     * U+10FFFF
-     * https://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
-     */
+//    /* XXX: the following code involves encoding & decoding of "UTF-16
+//     * surrogate pairs", which does not work with characters higher than
+//     * U+10FFFF
+//     * https://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
+//     */
+//    qDebug() << "!!!!!!!!!! length:" << length;
+//    qDebug() << "!!!!!!!!!! text:" << text;
+//    QString str = QString::fromUtf8(text, length);
+//    //QString str = fromUtf8(text, length);
+//    qDebug() << "!!!!!!!!!! str:" << str;
+//    auto encoded = _fromUtf8(str);
+//    qDebug() << "!!!!!!!!!! encoded:" << encoded;
+//    std::wstring unicodeText = encoded.data.toStdWString();
+//    qDebug() << "!!!!!!!!!! unicodeText:" << QString::fromStdWString(unicodeText.c_str());
+////    if (length >= 3 &&
+////        (
+////         /*(str[length - 3] == '\xEF' &&
+////        str[length - 2] == '\xBF' &&
+////        str[length - 1] == '\xBD')
+////         ||*/
+////         //(str[0]=='\xEF' && str[1]=='\xBF' && str[2]=='\xBD'))
+////    //str.contains(QChar::ReplacementCharacter))
+////    str[0]==QChar::ReplacementCharacter)
+////        ) {
+////
+////        qDebug() << "Replacement character detected at the front of the input.";
+////        throw std::runtime_error("Replacement character detected at the front of the input. Crashing as requested.");
+////        // Option 2: Crash the program using assertion
+////        Q_ASSERT(false); // This will crash the program with a debug assertion
+////    }
+//    // Count replacement characters in 'text'
+//    int textReplacementCount = countReplacementCharactersInText(text, length);
+//    // Count replacement characters in 'str'
+//    int strReplacementCount = countReplacementCharactersInString(str);
+//
+//    // Compare the counts
+//    if (strReplacementCount > textReplacementCount) {
+//        qDebug() << "More replacement characters found in 'str' than in 'text'.";
+//        throw std::runtime_error("Something got replaced with replacement character. Crashing as requested.");
+//        Q_ASSERT(false);//not reached
+//    }
+    qDebug() << "!!!!!!!!!! length:" << length;
+    qDebug() << "!!!!!!!!!! text:" << text; //yes replacement chars here because it get converted to utf8
     QString str = QString::fromUtf8(text, length);
-    auto encoded = _fromUtf8(str);
-    std::wstring unicodeText = encoded.data.toStdWString(); 
+    qDebug() << "!!!!!!!!!! str:" << str; //yes replacement chars here
+    // Count replacement characters in 'text'
+    int textReplacementCount = countReplacementCharactersInText(text, length);
+    // Count replacement characters in 'str'
+    int strReplacementCount = countReplacementCharactersInString(str);
+
+    //char combinedBuffer[3 + length];
+    //std::vector<char> combinedBuffer(_last_max_three_bytes_of_text.size() + length);
+
+    // do we have any new invalid utf8 bytes at the end of this new chunk?
+    int the_diff=strReplacementCount-textReplacementCount;
+    qDebug() << "!!!!!!!!!! the_diff:" << the_diff;
+    qDebug() << "!!!!!!!!!! _how_many_of_last_max_three_are_in:" << _how_many_of_last_max_three_are_in;
+
+
+
+    std::vector<uint8_t> combinedBuffer(_how_many_of_last_max_three_are_in + length);
+    //"allocates memory on the heap, but it is automatically deallocated when the vector goes out of scope, such as when the function exits"
+    int combinedLength = 0;
+    if (_how_many_of_last_max_three_are_in > 0) {
+        //copy last state bytes from the end of a prev. chunk which woulda been invalid utf8
+        //copy to the beginning of this new chunk
+        std::memcpy(combinedBuffer.data(), _last_max_three_bytes_of_text.data(), _how_many_of_last_max_three_are_in);
+        combinedLength+=_how_many_of_last_max_three_are_in;
+    }
+
+    // Compare the counts
+    if (the_diff>0) {
+        qDebug() << "!!!!!!!!!! More replacement characters("<<the_diff<<") found in 'str' than in 'text'.";
+        //throw std::runtime_error("Something got replaced with replacement character. Crashing as requested.");
+    }
+
+    // Copy new text to the combined buffer
+    // but not any invalid utf8 bytes from the end, if any.
+    std::memcpy(combinedBuffer.data() + combinedLength, text, length-the_diff);
+    combinedLength += length-the_diff;
+
+    // Try to convert the combined buffer to UTF-8
+    QString str2 = QString::fromUtf8(reinterpret_cast<const char*>(combinedBuffer.data()), combinedLength);
+    qDebug() << "!!!!!!!!!! resultAftermods:" << str2;
+    auto encoded = _fromUtf8(str2);
+    qDebug() << "!!!!!!!!!! encoded:" << encoded;
+    std::wstring unicodeText = encoded.data.toStdWString();
+    qDebug() << "!!!!!!!!!! unicodeText:" << QString::fromStdWString(unicodeText.c_str());
+//    //QString utf16Text = _decoder->toUnicode(text,length-the_diff);
+//    QString utf16Text = _decoder->toUnicode(reinterpret_cast<const char*>(combinedBuffer.data()),combinedLength);
+//    qDebug() << "!!!!!!!!!! utf16Text:" << utf16Text;
+
+    // Store the last up to 3 bytes of invalid UTF-8 from the end of the text
+    if (the_diff > 0) {
+        _how_many_of_last_max_three_are_in = the_diff;
+        std::memcpy(_last_max_three_bytes_of_text.data(), text + length - the_diff, the_diff);
+    } else {
+        _how_many_of_last_max_three_are_in = 0;
+    }

     //send characters to terminal emulator
     for (size_t i=0;i<unicodeText.length();i++)
diff --git a/lib/Emulation.h b/lib/Emulation.h
index 98ba2fe..27edaaf 100644
--- a/lib/Emulation.h
+++ b/lib/Emulation.h
@@ -470,6 +470,8 @@ protected:

   const KeyboardTranslator* _keyTranslator; // the keyboard layout
+  std::array<uint8_t, 3> _last_max_three_bytes_of_text = {0};
+  int _how_many_of_last_max_three_are_in = 0;

 protected slots:
   /**
same patch but for 1.4.0 (not needed as it works already without it, doh, but just to show that it works for 1.4.0 too, even if `_decoder` wasn't already keeping track of state internally. ```diff diff --git a/CMakeLists.txt b/CMakeLists.txt index ce50ed0..1747931 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -153,6 +153,7 @@ set_target_properties( ${QTERMWIDGET_LIBRARY_NAME} PROPERTIES VERSION ${QTERMWIDGET_VERSION} ) +set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fexceptions") if(APPLE) target_compile_definitions(${QTERMWIDGET_LIBRARY_NAME} diff --git a/lib/Emulation.cpp b/lib/Emulation.cpp index 723f5d2..8be0797 100644 --- a/lib/Emulation.cpp +++ b/lib/Emulation.cpp @@ -25,6 +25,7 @@ // System #include #include +#include // for memcpy #include #include @@ -38,6 +39,7 @@ #include #include +#include // KDE //#include @@ -239,6 +241,23 @@ void Emulation::sendMouseEvent(int /*buttons*/, int /*column*/, int /*row*/, int TODO: Character composition from the old code. See #96536 */ +int countReplacementCharactersInText(const char* text, int length) { + int count = 0; + for (int i = 0; i < length - 2; ++i) { + if (static_cast(text[i]) == 0xEF && + static_cast(text[i + 1]) == 0xBF && + static_cast(text[i + 2]) == 0xBD) { + count++; + i += 2; // Skip the next two bytes + } + } + return count; +} + +int countReplacementCharactersInString(const QString& str) { + return str.count(QChar::ReplacementCharacter); +} + void Emulation::receiveData(const char* text, int length) { emit stateSet(NOTIFYACTIVITY); @@ -250,8 +269,79 @@ void Emulation::receiveData(const char* text, int length) * U+10FFFF * https://unicodebook.readthedocs.io/unicode_encodings.html#surrogates */ - QString utf16Text = _decoder->toUnicode(text,length); + qDebug() << "!!!!!!!!!! length:" << length; + qDebug() << "!!!!!!!!!! text:" << text; //yes replacement chars here because it get converted to utf8 + QString str = QString::fromUtf8(text, length); + qDebug() << "!!!!!!!!!! str:" << str; //yes replacement chars here + // Count replacement characters in 'text' + int textReplacementCount = countReplacementCharactersInText(text, length); + // Count replacement characters in 'str' + int strReplacementCount = countReplacementCharactersInString(str); + + //char combinedBuffer[3 + length]; + //std::vector combinedBuffer(_last_max_three_bytes_of_text.size() + length); + + // do we have any new invalid utf8 bytes at the end of this new chunk? + int the_diff=strReplacementCount-textReplacementCount; + qDebug() << "!!!!!!!!!! the_diff:" << the_diff; + qDebug() << "!!!!!!!!!! _how_many_of_last_max_three_are_in:" << _how_many_of_last_max_three_are_in; + //qDebug() << "!!!!!!!!!! _last_max_three_bytes_of_text:" << _last_max_three_bytes_of_text; +// qDebug() << "!!!!!!!!!! _last_max_three_bytes_of_text:"; +// for (const auto& byte : _last_max_three_bytes_of_text) { +// qDebug() << QString::number(byte, 16).rightJustified(2, '0'); // Print each byte as a 2-digit hexadecimal number +// } +// qDebug() << "!!!!!!!!!! _last_max_three_bytes_of_text:"; +// for (const auto& byte : _last_max_three_bytes_of_text) { +// qDebug() << QString::number(static_cast(byte), 16).rightJustified(2, '0'); +// } +// qDebug() << "!!!!!!!!!! _last_max_three_bytes_of_text:"; +// for (const auto& byte : _last_max_three_bytes_of_text) { +// qDebug().noquote() << QString::number(static_cast(byte), 16).rightJustified(2, '0'); +// } +// qDebug() << "!!!!!!!!!! _last_max_three_bytes_of_text:"; +// for (const auto& byte : _last_max_three_bytes_of_text) { +// qDebug().noquote() << QString("%1").arg(static_cast(byte), 2, 16, QChar('0')); +// } + + + + std::vector combinedBuffer(_how_many_of_last_max_three_are_in + length); + //"allocates memory on the heap, but it is automatically deallocated when the vector goes out of scope, such as when the function exits" + int combinedLength = 0; + if (_how_many_of_last_max_three_are_in > 0) { + //copy last state bytes from the end of a prev. chunk which woulda been invalid utf8 + //copy to the beginning of this new chunk + std::memcpy(combinedBuffer.data(), _last_max_three_bytes_of_text.data(), _how_many_of_last_max_three_are_in); + combinedLength+=_how_many_of_last_max_three_are_in; + } + + // Compare the counts + if (the_diff>0) { + qDebug() << "!!!!!!!!!! More replacement characters("<(combinedBuffer.data()), combinedLength); + qDebug() << "!!!!!!!!!! resultAftermods:" << str2; + //QString utf16Text = _decoder->toUnicode(text,length-the_diff); + QString utf16Text = _decoder->toUnicode(reinterpret_cast(combinedBuffer.data()),combinedLength); + qDebug() << "!!!!!!!!!! utf16Text:" << utf16Text;//no repl. chars here, gets truncated before it std::wstring unicodeText = utf16Text.toStdWString(); + qDebug() << "!!!!!!!!!! unicodeText:" << QString::fromStdWString(unicodeText.c_str());//none here too + + // Store the last up to 3 bytes of invalid UTF-8 from the end of the text + if (the_diff > 0) { + _how_many_of_last_max_three_are_in = the_diff; + std::memcpy(_last_max_three_bytes_of_text.data(), text + length - the_diff, the_diff); + } else { + _how_many_of_last_max_three_are_in = 0; + } //send characters to terminal emulator for (size_t i=0;i _last_max_three_bytes_of_text = {0}; + int _how_many_of_last_max_three_are_in = 0; protected slots: /** ```

So to apply that, on qtermwidget:

cd /tmp
git clone https://github.com/lxqt/qtermwidget.git
cd qtermwidget
patch -p1 -i /tmp/theabove.patch
mkdir build
cd build
cmake -DCMAKE_INSTALL_PREFIX=/tmp/qtw2 ..
make && make install

Then same build steps(minus the patch) for qterminal itself, (or use system one? then mod the path below), and then:

I run it like:

#!/usr/bin/env -S -i PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin:${PATH}" LANG="${LANG}" TERM='xterm-256color' DISPLAY="${DISPLAY}" HOME="${HOME}" bash --noprofile --norc

cp ./qterminal.ini ~/.config/foo_qterminal.conf
declare -x XDG_RUNTIME_DIR="/run/user/1000"
export QT_LOGGING_RULES="*=true"

echo "set environment LD_LIBRARY_PATH /tmp/qtw2/lib64" > /tmp/gdb_cmds.txt
echo "run -p foo_qterminal -e 'less -R --shift 5 -- ./less.txt' > /tmp/foo7.log 2>&1 " >> /tmp/gdb_cmds.txt
gdb -x /tmp/gdb_cmds.txt /tmp/qtw2/bin/qterminal

Then press and hold Ctrl+L while less is showing inside the started qterminal. I notice it doesn't break (well, to be fair I also removed the throwing of exception), and then i can quit and check the /tmp/foo7.log file with nvim, search for and while I see it shows after and before on the text and str lines (see screenshot below) it doesn't show on the resultAftermods line at all, and futher, the next chunk has the braille char proper.

one two

tsujan commented 5 months ago

@correabuscar

This worked here (based on https://doc.qt.io/qt-6/qstringdecoder.html):

diff -ruNp qtermwidget-orig/lib/Emulation.cpp qtermwidget/lib/Emulation.cpp
--- qtermwidget-orig/lib/Emulation.cpp
+++ qtermwidget/lib/Emulation.cpp
@@ -52,7 +52,7 @@ Emulation::Emulation() :
   _keyTranslator(nullptr),
   _usesMouse(false),
   _bracketedPasteMode(false),
-  _fromUtf8(QStringEncoder::Utf16)
+  _toUtf16(QStringEncoder::Utf8)
 {
   // create screens with a default size
   _screen[0] = new Screen(40,80);
@@ -224,9 +224,9 @@ void Emulation::receiveData(const char*
      * U+10FFFF
      * https://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
      */
-    QString str = QString::fromUtf8(text, length);
-    auto encoded = _fromUtf8(str);
-    std::wstring unicodeText = encoded.data.toStdWString(); 
+    QByteArray ba(text, length);
+    QString str = _toUtf16(ba);
+    std::wstring unicodeText = str.toStdWString();

     //send characters to terminal emulator
     for (size_t i=0;i<unicodeText.length();i++)
diff -ruNp qtermwidget-orig/lib/Emulation.h qtermwidget/lib/Emulation.h
--- qtermwidget-orig/lib/Emulation.h
+++ qtermwidget/lib/Emulation.h
@@ -494,7 +494,7 @@ private:
   bool _bracketedPasteMode;
   QTimer _bulkTimer1{this};
   QTimer _bulkTimer2{this};
-  QStringEncoder _fromUtf8;
+  QStringDecoder _toUtf16;
 };

 }

Please test it.

If I'm not mistaken, the logic used in the Qt6 port was incorrect.

correabuscar commented 5 months ago

@correabuscar

This worked here (based on https://doc.qt.io/qt-6/qstringdecoder.html):

diff -ruNp qtermwidget-orig/lib/Emulation.cpp qtermwidget/lib/Emulation.cpp
--- qtermwidget-orig/lib/Emulation.cpp
+++ qtermwidget/lib/Emulation.cpp
@@ -52,7 +52,7 @@ Emulation::Emulation() :
   _keyTranslator(nullptr),
   _usesMouse(false),
   _bracketedPasteMode(false),
-  _fromUtf8(QStringEncoder::Utf16)
+  _toUtf16(QStringEncoder::Utf8)
 {
   // create screens with a default size
   _screen[0] = new Screen(40,80);
@@ -224,9 +224,9 @@ void Emulation::receiveData(const char*
      * U+10FFFF
      * https://unicodebook.readthedocs.io/unicode_encodings.html#surrogates
      */
-    QString str = QString::fromUtf8(text, length);
-    auto encoded = _fromUtf8(str);
-    std::wstring unicodeText = encoded.data.toStdWString(); 
+    QByteArray ba(text, length);
+    QString str = _toUtf16(ba);
+    std::wstring unicodeText = str.toStdWString();

     //send characters to terminal emulator
     for (size_t i=0;i<unicodeText.length();i++)
diff -ruNp qtermwidget-orig/lib/Emulation.h qtermwidget/lib/Emulation.h
--- qtermwidget-orig/lib/Emulation.h
+++ qtermwidget/lib/Emulation.h
@@ -494,7 +494,7 @@ private:
   bool _bracketedPasteMode;
   QTimer _bulkTimer1{this};
   QTimer _bulkTimer2{this};
-  QStringEncoder _fromUtf8;
+  QStringDecoder _toUtf16;
 };

 }

Please test it.

If I'm not mistaken, the logic used in the Qt6 port was incorrect.

The patch works indeed! (EDIT: real world btop is fixed as well) That is amazing what you can do if you know what's going on (and if you know QT/C++, unlike me haha) ! Much appreciated and very cool to see such a tiny change that does exactly the same thing that 1.4.0 did with _decoder (i've tested it to be so)

This is really cool patch! Cheers!

I don't know why it didn't even cross my mind to check online for documentation, as you pointed out in another place "The encoder remembers any state that is required between calls,..."

modified version of the above patch, that I used to check that it does indeed work: ```patch diff --git a/CMakeLists.txt b/CMakeLists.txt index 9cf3516..3533c86 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -153,6 +153,7 @@ set_target_properties( ${QTERMWIDGET_LIBRARY_NAME} PROPERTIES VERSION ${QTERMWIDGET_VERSION} ) +set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -fexceptions") if(APPLE) target_compile_definitions(${QTERMWIDGET_LIBRARY_NAME} diff --git a/lib/Emulation.cpp b/lib/Emulation.cpp index c88ac12..dee470b 100644 --- a/lib/Emulation.cpp +++ b/lib/Emulation.cpp @@ -52,7 +52,7 @@ Emulation::Emulation() : _keyTranslator(nullptr), _usesMouse(false), _bracketedPasteMode(false), - _fromUtf8(QStringEncoder::Utf16) + _toUtf16(QStringEncoder::Utf8) { // create screens with a default size _screen[0] = new Screen(40,80); @@ -213,6 +213,23 @@ void Emulation::sendMouseEvent(int /*buttons*/, int /*column*/, int /*row*/, int TODO: Character composition from the old code. See #96536 */ +int countReplacementCharactersInText(const char* text, int length) { + int count = 0; + for (int i = 0; i < length - 2; ++i) { + if (static_cast(text[i]) == 0xEF && + static_cast(text[i + 1]) == 0xBF && + static_cast(text[i + 2]) == 0xBD) { + count++; + i += 2; // Skip the next two bytes + } + } + return count; +} + +int countReplacementCharactersInString(const QString& str) { + return str.count(QChar::ReplacementCharacter); +} + void Emulation::receiveData(const char* text, int length) { emit stateSet(NOTIFYACTIVITY); @@ -224,9 +241,29 @@ void Emulation::receiveData(const char* text, int length) * U+10FFFF * https://unicodebook.readthedocs.io/unicode_encodings.html#surrogates */ + qDebug() << "!!!!!!!!!! length:" << length; + qDebug() << "!!!!!!!!!! text:" << text; //yes replacement chars here because it get converted to utf8 QString str = QString::fromUtf8(text, length); - auto encoded = _fromUtf8(str); - std::wstring unicodeText = encoded.data.toStdWString(); + qDebug() << "!!!!!!!!!! str:" << str; //yes replacement chars here + // Count replacement characters in 'text' + int textReplacementCount = countReplacementCharactersInText(text, length); + // Count replacement characters in 'str' + int strReplacementCount = countReplacementCharactersInString(str); + int the_diff=strReplacementCount-textReplacementCount; + qDebug() << "!!!!!!!!!! the_diff:" << the_diff; + // Compare the counts + + + + QByteArray ba(text, length); + QString str2 = _toUtf16(ba); + qDebug() << "!!!!!!!!!! str2:" << str2; + std::wstring unicodeText = str2.toStdWString(); + qDebug() << "!!!!!!!!!! unicodeText:" << unicodeText.c_str(); + if (the_diff>0) { + qDebug() << "!!!!!!!!!! More replacement characters("<
tsujan commented 5 months ago

I don't know why it didn't even cross my mind to check online for documentation

Actually, this part of Qt's documentation is exceptionally obscure and needs to be rewritten (I think I've seen a bug report about the doc), perhaps because it's new. If it wasn't for a few examples in it, I wouldn't understand anything.

Since your experience in dealing with characters is more than mine, please test the patch under various circumstances. I'll make a PR and release 2.0.1 after merging it. But first I should check another part of the code, where the new Qt6 encoding/decoding methods are also used.

Thanks a lot for the report and also for the discussion!

correabuscar commented 5 months ago

nit: I've corrected insecure shebang in the above comments, added PATH prefix: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin: full: #!/usr/bin/env -vS -i PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin:${PATH}" LANG="${LANG}" TERM='xterm-256color' diff:

-#!/usr/bin/env -vS -i PATH="${PATH}" LANG="${LANG}" TERM='xterm-256color'
+#!/usr/bin/env -vS -i PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/opt/bin:${PATH}" LANG="${LANG}" TERM='xterm-256color'

because I don't want a user set PATH (possibly unintentionally if not maliciously) to execute a non-system bash executable. And just in case anyone stumbled upon this thinking #!/usr/bin/env bash is a secure portable shebang, it's not without that PATH prefix esp. if the potential script ends up being installed system-wide.

(anyway, I'm hiding this comment as off-topic)