wchar_t / unicode support broken? - initscr breaking putwc

0xE70000000 commented 2 years ago

Hi,

i've spent quite a bit of time trying to debug this issue and at this point i feel confident to say it might be a bug. If putwc is called with a valid codepoint (i've used 0xF6 for testing which should render as "ö") before initscr everything works as expected but anywhere after initscr non 7bit codepoints render as unknown characters (inverse color question marks) and as putwc is used by all functions printing wchar_t strings (for example waddwstr) this basically breaks unicode support. Locale settings (setlocale) don't seem to affect this at all (en_US.UTF-8, C or an invalid setting all result in the same behavior).

Trying to track down the cause i've come to the conclusion that it seems to be related to tputs/ti_puts calls (adding macros to curses_private.h turning those into noops keeps putwc functional but obviously otherwise breaks rendering badly). Sadly i wasn't able to pin this to a certain call as it seems at least multiple (or even all of them?) result in putwc breaking. Any kind of hint or pointer on to how to get this working would be greatly appreciated.

Testing was done on Devuan Ascii (a linux distribution pretty much identical to Debian 9.0 sans systemd). Curses version used is 0.3.2 release. I've also tried a handful of different terminal definitions which made no difference regarding putwc behavior (terminal used for most testing is sakura which is libvte based identifying as xterm-256color. Xterm (the application, not just the definition) gave the same results though.

0xE70000000 commented 2 years ago

Small update: In the meantime i have rebuilt my application against ncursesw which resolved the issue with no code changes whatsoever besides including the appropriate header files. This is somewhat unfortunate as i would rather use netbsd-curses but given that even after extensive debugging i have failed to track down the cause of putwc failing it seems that is the best i can do for now.

rofl0r commented 2 years ago

hmm, interesting, i didnt get a github notification when you opened this issue 7 days ago. anyway, did you try the 0.3.1 release ? it's what i'm still using and i haven't come across any issues so far. if you can confirm that 0.3.1 works we can try to bisect the commit that broke it.

0xE70000000 commented 2 years ago

Sadly building against 0.3.1 does not seem to make a difference. Non 7bit characters still show up as inverse question marks. What might be somewhat interesting though is that adding borders generally works. I guess the characters used to render the border aren't 7bit either but then rendering those is done using "alternative charset mode" (i am not sure how that works). I've tried patching one of the codepoints that seemed to be assigned to the be the "alternate" representation of a 7bit character but the change was ignored so being able to render those might not mean much as it seems the association is hardcoded one way or another.

Edit: I have just tried all versions down to 0.0.4 which while somewhat differing in rendering all exhibit the same behavior regarding non 7bit wchars.

rofl0r commented 2 years ago

thanks for testing. could you come up with a minimal example program exposing the behaviour ?

0xE70000000 commented 2 years ago

I've tried but it proved difficult. My application does not do a whole lot before initializing curses but even after stripping basically everything it still seems to do "something" that results in a different scenario than just calling putwc, initscr and then putwc again. The most similar behavior i managed to trigger was put putwc failing before and after initscr. Trying to replicate the original scenario i noticed a lot of weird behavior from putwc for no obvious reason though, so i decided to dig into glibc to see what putwc was actually doing and that's when things started making a lot more sense:

glibc handles wide characters pretty much separate from normal output. Down to using different buffers based on which kind of character is being written. Intermixing both kind of writes seems like a pretty bad idea with this logic and from what i saw in my testing i am almost certain doing so causes errors in regards to what is actually written to the screen (missing output - likely due to reading from the wrong internal buffer - and similar weird behavior).

There is also another thing which is probably the main cause of what i was seeing earlier. Glibc keeps kind of a "wide character state" on the FILE handle (mostly stdout in this case). It's not obvious (especially in regards to stdout/stderr) where this is actually decided (it's possible to force it on open - i highly doubt this is done for the default descriptors though) but it seems possible for glibc to set this the fly when handling functions that write to the FILE handle. Now this state contains a reference to a conversion function that deals with turning the wchar_t data into something the terminal understands. This state is set ONCE and never touched again, so if there is any output before a usable LC_CTYPE (which is queried it seems) is setup or it changes later on the conversion function reference on the file handle goes out of sync...

Now my theory is that my original code somehow triggers glibc to set the wide character conversion function to something that goes out of sync during initscr. Either that or the tputs/ti_puts calls using non wide writes simply confuse glibc's internal state.

I am somewhat short on time right now but i'll do a bit more testing to see if a call to freopen (which should reset the wide character state) has any effect. It also seems that fprintf( stdout, "%lc", wchar ) has a slightly more intelligent logic when it comes to intermixing character types. So if the freopen approach fails i'll probaly try to replace putwc with fprintf which would admittedly be kinda ugly but so seems glibc's approach at handling wide character output.

0xE70000000 commented 2 years ago

I've had a little more time than i initially thought so i investigated a bit. Here is some code demonstrating glibc's confusion when mixing putwc with putc calls:

#include <stdio.h>
#include <locale.h>
#include <unistd.h>
#include ".build/include/curses.h"

// broken putwc output
void test1( void ) {
    fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
    putc( 'f', stdout );
    putc( '\n', stdout );
    putwc( 0xf6, stdout );
    putwc( '\n', stdout );
}

// putc output ignored
void test2( void ) {
    fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
    putwc( 0xf6, stdout );
    putwc( '\n', stdout );
    putc( 'f', stdout );
    putc( '\n', stdout );
}

// broken putwc after initscr (likely due to calls to putc)
void test3( void ) {
    fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
    initscr();
    putwc( 0xf6, stdout );
    putwc( '\n', stdout );
}

// surprisingly working putwc
// (maybe due to putwc before initscr setting stdout to wide orientation?)
void test4( void ) {
    fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
    putwc( 0xf6, stdout );
    putwc( '\n', stdout );
    sleep( 5 );
    initscr();
    putwc( 0xf6, stdout );
    putwc( '\n', stdout );
}

// unsurprisingly broken putwc
void test5( void ) {
    fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
    putc( 'f', stdout );
    putc( '\n', stdout );
    sleep( 5 );
    initscr();
    putwc( 0xf6, stdout );
    putwc( '\n', stdout );
}

// fputwc is not identical to putwc (no output)
void test6( void ) {
    fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
    putc( 'f', stdout );
    putc( '\n', stdout );
    sleep( 5 );
    initscr();
    fputwc( 0xf6, stdout );
    fputwc( '\n', stdout );
}

// unsurprisingly broken putc
void test7( void ) {
    fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
    putwc( 0xf6, stdout );
    putwc( '\n', stdout );
    sleep( 5 );
    initscr();
    putc( 'f', stdout );
    putc( '\n', stdout );
}

// missing linebreak in waddstr and waddwstr
void test8( void ) {
    char test1[] = { 'f', '\n', 0 };
    wchar_t test2[] = { 0xf6, '\n', 0x00 };
    WINDOW *win;

    fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
    putwc( 0xf6, stdout );
    putwc( '\n', stdout );
    sleep( 5 );
    initscr();
    win = newwin( 3, 10, 0, 0 );
    waddstr( win, test1 );
    waddwstr( win, test2 );
    wrefresh( win );
    sleep( 5 );
}

// waddwstr output broken
void test9( void ) {
    char test1[] = { 'f', '\n', 0 };
    wchar_t test2[] = { 0xf6, '\n', 0x00 };
    WINDOW *win;

    fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
    putc( 'f', stdout );
    putc( '\n', stdout );
    sleep( 5 );
    initscr();
    win = newwin( 3, 10, 0, 0 );
    waddstr( win, test1 );
    waddwstr( win, test2 );
    wrefresh( win );
    sleep( 5 );
}

// waddwstr output broken
void test10( void ) {
    char test1[] = { 'f', '\n', 0 };
    wchar_t test2[] = { 0xf6, '\n', 0x00 };
    WINDOW *win;

    fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
    sleep( 5 );
    initscr();
    win = newwin( 3, 10, 0, 0 );
    waddstr( win, test1 );
    waddwstr( win, test2 );
    wrefresh( win );
    sleep( 5 );
}

// waddwstr output broken
void test11( void ) {
    char test1[] = { 'f', '\n', 0 };
    wchar_t test2[] = { 0xf6, '\n', 0x00 };
    WINDOW *win;

    fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
    sleep( 5 );
    initscr();
    win = newwin( 3, 10, 0, 0 );
    waddwstr( win, test2 );
    waddstr( win, test1 );
    wrefresh( win );
    sleep( 5 );
}

// missing linebreak in waddstr and waddwstr
void test12( void ) {
    char test1[] = { 'f', '\n', 0 };
    wchar_t test2[] = { 0xf6, '\n', 0x00 };
    WINDOW *win;

    fprintf( stderr, "locale = %s\n", setlocale( LC_ALL, "en_US.UTF-8" ) );
    sleep( 5 );
    initscr();
    stdout = freopen( NULL, "w", stdout );
    win = newwin( 3, 10, 0, 0 );
    waddwstr( win, test2 );
    waddstr( win, test1 );
    wrefresh( win );
    sleep( 5 );
}

int main( int argc, char **argv ) {
    test12();
    return 0;
}

Interestingly from reading glibc source it seems neither putc nor putwc itself set the stream orientation (which is usually done using an implicit call to fwide) but obviously something down the line does. Calling freopen after initscr to reset the internal state actually did make a difference but output is still broken (ignoring line breaks). To be honest i am pretty baffled with glibc's behavior. I've tried patching putchar.c:92 to use status = fprintf(outfd, "%lc", wch); instead of status = putwc(wch, outfd); which does fix the weirdness but not really because it's more intelligent or even "doing the right thing". What fprintf does for wide characters is call wcrtomb to convert to multibyte no matter what which does work in my case because of of the UTF-8 locale but would likely fail for anything else.

rofl0r commented 2 years ago

ouch, sorry that i didn't realize it before, but you cannot safely mix stdio output with curses output, because both systems use internal buffering and bookkeeping of positions and such, and are unaware of each other. a curses application shouldnt output anything before calling initscr(), and after that only use curses routines for output.

0xE70000000 commented 2 years ago

No need to be sorry. It's not really related to stdio anyways. You can take for example test10 and remove the fprintf to stderr (which was actually chosen to avoid messing with stdouts internal state) and the result is still the same:

// waddwstr output broken
void test10( void ) {
    char test1[] = { 'f', '\n', 0 };
    wchar_t test2[] = { 0xf6, '\n', 0x00 };
    WINDOW *win;

    setlocale( LC_ALL, "en_US.UTF-8" );
    initscr();
    win = newwin( 3, 10, 0, 0 );
    waddstr( win, test1 );
    waddwstr( win, test2 );
    wrefresh( win );
    sleep( 5 );
}

It's not even about mixing waddstr and waddwstr. The following version just calling waddwstr does the same:

// waddwstr output broken
void test10( void ) {
    wchar_t test2[] = { 0xf6, '\n', 0x00 };
    WINDOW *win;

    setlocale( LC_ALL, "en_US.UTF-8" );
    initscr();
    win = newwin( 3, 10, 0, 0 );
    waddwstr( win, test2 );
    wrefresh( win );
    sleep( 5 );
}

I am pretty sure it's related to glibc's internal stream state in regard to wide character handling associated with stdout being chosen by the first write occurring to it which is likely some putc call invoked by tputs/ti_puts during initscr (like i've said initially, hacking libcurses to turn tputs/ti_puts into noops keeps wide character output functional). At least i don't have a better explanation for what i am seeing. It's not like i've read every single line of libcurses / glibc and fully comprehend every little detail (especially in regards to libcurses - the code is pretty crazy at times!).

rofl0r commented 2 years ago

test10 prints f and ö separated by a newline on my musl-libc based system, and i suspect that's what you expect it to do. i can't currently test on glibc as i have to set up a test env first. what happens if you remove the setlocale call and export LC_ALL=C before running the program on glibc ?

0xE70000000 commented 2 years ago

Yes, that would be exactly what i'd expect to show up on the screen. What i am getting is an f and an � (inverse color question mark) though. Removing setlocale and exporting LC_ALL makes no difference. Same goes for calling setlocale( LC_ALL, "C" ) or setlocale( LC_ALL, "some-garbage" ).

0xE70000000 commented 2 years ago

By the way there is no need to rush this issue. I've added a switch to my build scripts for now that selects between netbsd-curses and ncurses (if i end up with a bit of free time i might also take a look to see if i can figure out what ncurses does differently in regards to wide character handling) so i can easily switch back if a solution is found and it doesn't bother me much right now. I am generally quite grateful you are taking the time to look into this.

richfelker commented 2 years ago

It looks like netbsd curses is internally mixing byte and wide stdio calls, which produces undefined behavior. See at least:

https://github.com/sabotage-linux/netbsd-curses/blob/ae69600380da3e0a8ea509fee65dbd5575a34daa/libcurses/putchar.c#L62-L92

I think you want to use narrow stdio functions entirely regardless of the type of character being printed. See if replacing the putwc call with fprintf(outfd, "%lc", wch) fixes it. There may be other places where a similar fix is needed. Of course printf is probably unwanted overhead here, so if it does work, you should probably switch to manually doing mbrtowc then fwrite or fputs.

0xE70000000 commented 2 years ago

Well put. I pretty much agree with you. Regarding wcrtomb (i think that's what you actually meant?) i am little worried though if it's really sufficient when dealing with non UTF-8 capable terminals. Glibc is jumping through a lot of hoops to properly convert wide character output (see iofwide.c). According to the comments in that file the logic chooses appropriate conversion routines based on the value of LC_CTYPE. If replacing putwc is found to be the solution somewhat replicating that logic (which mostly seems to rely on __wcsmbs_clone_conv returning pointers to the needed functions) might be preferable to simply calling wcrtomb. I have to admit that i am not really sure if this is just kind of a low level way of using wcrtomb though.

rofl0r commented 2 years ago

@michaelforney has produced this patch: http://ix.io/49Ui - you might want to try it out.

0xE70000000 commented 2 years ago

Nice, works for me! I've build the test10 code against libcurses patched with your link and it printed f newline ö newline. If you feel confident with the approach this issue can be closed i guess. Thanks a lot!

rofl0r commented 2 years ago

thanks for confirmation. i'll close the issue once the patch is merged - which can take some time as michael wants to upstream it to netbsd proper so i have to backport the changes since last year from there.

0xE70000000 commented 2 years ago

Alright. Seems sensible.

richfelker commented 2 years ago

Regarding wcrtomb (i think that's what you actually meant?) i am little worried though if it's really sufficient when dealing with non UTF-8 capable terminals.

wcrtomb will convert to whatever the locale's multibyte encoding is, which is exactly what wide character stdio would do except that the choice of encoding is based on the current locale rather than the locale that was active at the time the FILE stream became wide-oriented.

0xE70000000 commented 2 years ago

I see. The proposed solution is probably fine then. Sorry, i didn't realize until now that you are the author of musl and likely know a thing or two about what standard functions do ;)

sabotage-linux / netbsd-curses

wchar_t / unicode support broken? - initscr breaking putwc #53