Workaround for the missing UTF-8 support in Speakup I18N files and /speakup/synth_direct

Palacee-hun commented 9 months ago

Speakup scrambles characters in the 0x80 - 0xff code range in two cases:

If a software speech connector client (like Espeakup) reads out speech via /dev/softsynthu (I think almost always nowadays), then Speakup converts each code point in its internal synth buffer with a code above 0x7f to UTF-8 before passing it on. But UTF-8 itself uses bytes in the 0x80 - 0xff range for its multibyte encoding scheme. So Speakup encodes the individual bytes of an UTF8 sequence themselves to UTF8! For example 'á' (code 0xe1) is 0xc3 0xa1 in UTF8, but Speakup scrambles this to 0xc3 0x83 0xc2 0xa1 when passing text onto Espeakup. The result is weird speech in languages with code points above 0x7f.
There are code paths in Speakup which give the 'synth_buffer_add' internal Speakup function a 'char', which is a signed 8-byte value. 'synth_buffer_add' takes an 'u16' (which is appropriate taking Unicode into consideration), but the weak type system of C promotes the 'char' to 'u16' by sign-extension, but that is not what we would want. In these code paths characters in the 0x80 - 0xff range get sign-extended to the 0xff80 - 0xffff range when added to the internal Speakup synth buffer. Of course Espeakup gets these as UTF8-encoded sequences (3 bytes in this case). The result is even more garbled text in UTF8 locales. About 6 years ago I tracked these problems down with very hard work when I began to use the first Git version of Espeakup that supported /dev/softsynthu. I decided to implement a workaround for these scramblings on the side of Espeakup, because I found that much much easier than fixing them in Speakup kernel code. My opinion on this hasn't changed since. I managed to implement the workaround in Espeakup on my Linux box and have been field-testing it ever since. It works very well. Now I find it high time to create a pull request with my workaround, and that is what I am about to do.

sthibaul commented 9 months ago

If this had been reported 6 years ago, perhaps we would have fixed it and people wouldn't have been bugged by this...

If a software speech connector client (like Espeakup) reads out speech via /dev/softsynthu (I think almost always nowadays), then Speakup converts each code point in its internal synth buffer with a code above 0x7f to UTF-8 before passing it on. But UTF-8 itself uses bytes in the 0x80 - 0xff range for its multibyte encoding scheme. So Speakup encodes the individual bytes of an UTF8 sequence themselves to UTF8! For example 'á' (code 0xe1) is 0xc3 0xa1 in UTF8, but Speakup scrambles this to 0xc3 0x83 0xc2 0xa1 when passing text onto Espeakup. The result is weird speech in languages with code points above 0x7f.

I'm really surprised by this report, as I cannot reproduce it with e.g. linux 6.4.0. When I type 'é' 0xe9, what I read from /dev/softsynthu with hexdump -C is c3 a9 18, which is just alright. Running echo é also gives that.

There are code paths in Speakup which give the 'synth_buffer_add' internal Speakup function a 'char'

Ah, indeed, these needs fixing. But again, better just fix the kernel (and have the fix backported to stable kernels), the fix will actually come sooner to the average users who don't build espeakup themselves.

sthibaul commented 9 months ago

There are code paths in Speakup which give the 'synth_buffer_add' internal Speakup function a 'char'

Did you notice several of them? I can only see the call from synth_write, which is only called from speakup_file_write (triggered by writing to /dev/synth) and from synth_direct_store (triggered by writing to /sys/accessibility/speakup/synth_direct)

sthibaul commented 9 months ago

and these have always been 8bit files, not utf-8

sthibaul commented 9 months ago

when I properly emit latin1 to these, I do get the expected behavior

sthibaul commented 9 months ago

I'll however submit adding the cast since it really deserves adding.

sthibaul commented 9 months ago

So, do you see garbled output without using /dev/synth or /sys/accessibility/speakup/synth_direct ?

sthibaul commented 9 months ago

I'll however submit adding the cast since it really deserves adding.

Here it is: https://lore.kernel.org/lkml/20240203233600.gu4qci36fpnro3ui@begin/T/#u

Palacee-hun commented 9 months ago

"If this had been reported 6 years ago, perhaps we would have fixed it and people wouldn't have been bugged by this..." You absolutely havezero right to say this. If you are faced with the passing of all your relatives one after the other in unbelievable agonies, then if you are half-through this, you are confronted with ttwo years of Covid restrictions, and if you survived all this, you get high inflation to manage due to the Ukrainian war in your neighbourhood, please believe me, reporting a Linux bug is on the zillionth place on your to-do list. And this all happened to me. People too often forget nowadays that life is not just about IT. Youonly have right to say "thank you for your hard work". This was quite a mind-boggling bug. "I'm really surprised by this report, as I cannot reproduce it with e.g. linux 6.4.0. When I type 'é' 0xe9, what I read from /dev/softsynthu with hexdump -C is c3 a9 18, which is just alright. Running echo é also gives that." Yes, this misled me too back then. Try to do a screen review command on a line with that 'é'. Or try typing it into Nano, and review the line. I am sure you will see what I mean. Before preparing this issue, and my pull request, now that I have some energy and mood for this, I dug through the recent Speakup code, and it changed nothing in this respect, the double UTF-8 encoding must still happen, as nothing prevents it. Fixing it in the kernel is hard because of the architecture of the 'read_softsynthx' internal function (or something called nearly so), which processes character by character, and handling this requires another viewpoint.

sthibaul commented 9 months ago

Try to do a screen review command on a line with that 'é'.

That does produce the expected result:

62 61 73 68 3a 20 c3 a9 c3 a9 c3 a9 3a 20 63 6f 6d 6d 61 6e 64 20 6e 6f 74 20 66 6f

try typing it into Nano, and review the line

That does also work, be it by line

c3 a9 0a 18

or by char

c3 a9 18

Palacee-hun commented 9 months ago

Have you done those hexdumps on /dev/softsynthu, and not on /dev/softsynth accidentally? I definitely ask this, because I strongly remember that I could not access /dev/softsynthu for examination by any tool when espeakup ran, i.e. when using speech, not even as root (which I found logical by the way as it was being used by espeakup). That was one source of the hardness of diagnosis. I had to decypher what might go on from the speech output by ear. This matters very much because if/dev/softsynth is read, then that internal helper function that is responsible for reading from both devices is ccalled with unicode = false, and in that case all 256 characters pass on unchanged. The described scrambling only occurs if that function is called with unicode = true, as in that case anything above 0x7f is UTF-8 encoded there. And that takes place only when reading /dev/softsynthu. As an addendum I note that my locale is Hungarian, and I use espeakup with the hu+Max voice. My distro is Arch that has mostly localised messages. I use 'setfont eurlatgr' command to set my console font, otherwise some Hungarian characters are replaced with squares, and announced so. I note also that I detected this issue seriously when I began to put together a quick-and-dirty partial localisation for Speakup to Hungarian. I prepared a characters file for this in Nano, and noticed that two weird characters were read for some Hungarian accented characters. I note here that some Hungarian accented characters are below 0x100, others are above, like 'ő' (0x151). I remember also through all those things and years, that back then I wasconversing with a Hungarian guy about unicode support in espeakup and he also noted weirdness with some Hungarian code points. I definitely know also that the code I have submitted now solved all those weirdnesses for good.

sthibaul commented 9 months ago

Have you done those hexdumps on /dev/softsynthu

I am talking about /dev/softsynthu, sure.

I note here that some Hungarian accented characters are below 0x100, others are above, like 'ő' (0x151)

Ok, but your reproduction case was below 0x100, right? characters above 0x100 are not supposed to face stronger issues than those below. And below 0x800 it will even be with 2 bytes still.

ő does show up properly as c5 91 in softsynthu with linux 6.4.

I definitely know also that the code I have submitted now solved all those weirdnesses for good.

Perhaps "for good", but not in a good way. If the user happens to have some output which is valid double-utf-8, your workaround will mangle that. We do not want this.

Palacee-hun commented 9 months ago

Perhaps "for good", but not in a good way. If the user happens to have some output which is valid double-utf-8, your workaround will mangle that. We do not want this." I note here that it is theoretically impossible to tell with 100 % certainty without a BOM that an encoded byte sequence is real UTF-8 or just happens to fulfill the encoding scheme by chance. Of course that holds true for my algorithm as well. But a scrambled double UTF8-encoded sequence has a very distinct pattern. If it is made from a 2 byte UTF8, then it is 0xc3 x 0xc2 y, where x is between 0x80 and 0x9f and y is between 0x80 and 0xbf. For an originally 3 byte UTF-8, it is 0xc3 z 0xc2 y 0xc2 y, where z is between 0xa0 and 0xaf. Such a byte sequence looks and sounds definitely gibberish. Although it is not totally impossible, but it is extremely unlikely that it can have any meaning in a text. That is one reason I took this route. Nevertheless please just come up with anything better, please only note however that a perfect solution for this simply doesn't exist. By the way I have encountered even in 2023 some web-based systems that produce such double-encoded scrambled gibberish totally independent of Espeakup or any speech system whatsoever. I recognise them instantly since this Linux adventure. Until the day when every system always uses one and only one encoding standard, problems like this will eventually pop up here and there. And I think that day will never come.

sthibaul commented 9 months ago

it is extremely unlikely that it can have any meaning in a text

We don't want "extremely unlikely" issues. Some people do work on character encoding, and do expect their screen reader to properly render really exactly what is being shown by their code, and not worked around.

Nevertheless please just come up with anything better

As I wrote, I am not getting any issue. If there is an issue with your setup, we need to determine how particular your setup is, to determine why things are going wrong. We will very likely discover something that might have had other consequences, and thus fix even more things in a go.

a perfect solution for this simply doesn't exist

It does. Whether some /dev reads/writes utf-8 or latin1 or anything is is something that is defined. By just obeying to that, we'll have a computer that just works. Always.

some web-based systems that produce such double-encoded scrambled gibberish

Yes. That doesn't meant it's up to the screen reader to compensate for this. That can only bring harm.

sthibaul commented 9 months ago

Just to be sure: is your linux console really set up in utf-8 mode? You can try to enable it with

printf '\033%%G'

If your distribution wasn't doing this for you, it really should since all kinds of breakage follows otherwise.

Palacee-hun commented 9 months ago

Okay, I have managed to clear this up finally. It was not a walk in the park, but strace was my friend while doing it. Thanks strace, you have saved my day! For reference here is the command line that helps to see exactly what is read from /dev/softsynthu by Espeakup (quite impossible to do otherwise when Espeakup provides speech): 'strace -p $(pgrep espeakup) -ff -o strace_espeakup --strings-in-hex=non-ascii-chars -s 1024 -P /dev/softsynthu' The problem is that bacck then when I did my localisation attempt to Hungarian on the Speakup i18n files, I didn't realise that they didn't support UTF-8 and furthermore using UTF-8 there would result in garbled speech and many hours of mind-puzzling and frustration. They simply treat UTF-8 as 8-bits and of course "logically" double-encode it as described here. I don't remember any material about Speaakup clearly stating this back then. And that is no wonder as I perceived that most folks using Speakup/Espeakup were English speakers, where this is no problem. /speakup/synth_direct also doesn't understand UTF-8, so double encoding occurs. So all in all the code I have submitted here has been working around the missing UTF-8 support in those Speakup files. And of course it's not the proper way to do it. But as I didn't see much activity around Speakup/Espeakup, I thought that they were dormant. Most Linux users I heard about already switched to an Orca-based screen reading setup. I was quite surprised to see a new Espeakup version when recently I did a system upgrade on my Linux box (which I didn't care about for years). That's why I brought up this issue here at all. As Speakup has /dev/softsynthu, I think it would be a logical step to implement UTF_8 support for the i18n files and /speakup/synth_direct. I find this worthwhile as 8-bit encodings for non-English languages are quite obsolete and very messy and painful at times (the latter is especially true for Hungarian). Furthermore taking this step doesn't seem hard to me: just interpret the text sent to those files as UTF-8 and then put the decoded code points into the already Unicode-ready internal synth buffer.

sthibaul commented 8 months ago

Is there a reason for using synth_direct rather than /dev/synth? AFAIK, the former has a limited size while /dev/synth is not limited.

Palacee-hun commented 8 months ago

I haven't been aware of /dev/synth and the differences between the two. Thanks for pointing that out. Is /dev/synth UTF-8 / unicode ready yet? I haven't seen any recent / updated docs about Speakup, I wonder if there's any.

sthibaul commented 8 months ago

Is /dev/synth UTF-8 / unicode ready yet?

Not yet, patches are pending for linux 6.9 or 6.10

sthibaul commented 8 months ago

That'll be /dev/synthu

sthibaul commented 7 months ago

For information, I have also submitted a patch to make the i18n files utf-8.

Palacee-hun commented 7 months ago

Thanks much, this is good news, very good indeed. Is a patch also underway for the sign-extension bug I have described above?

sthibaul commented 7 months ago

it's already queued for 6.10 too

sthibaul commented 7 months ago

synthu and the sign-extension patch are actually queued for 6.9, it's the utf-8 i18n which will be for 6.10

linux-speakup / espeakup

Workaround for the missing UTF-8 support in Speakup I18N files and /speakup/synth_direct #54