Better unicode support - Githubissues

rcr commented 9 years ago

I received some feedback demonstrating some bugs in how unicode is handled. This is caused by the way rirc discards non-printable ascii bytes. The fix for this would require a bit of refactoring to replace all instances of char with wide chars, and improve the receiving/input routines.

I'll be tackling this fix next.

ghost commented 8 years ago

Thank you for working on this issue. I also noticed it. But there is no real hurry of course :)

seanmadden commented 8 years ago

👍

samcf commented 8 years ago

:+1:

mjhale commented 8 years ago

:+1:

CamilleScholtz commented 8 years ago

There is a thumb up button so this stupid spam isn't necessary anymore...

ghost commented 8 years ago

What I found now, maybe you already know it, but still, it's sort of documentation for other contributors we have up there ^ ;)

The function called to print a message is in draw.c, and print the message char by char: printchar(char c). This indeed drop the wide chars (non ASCII ones).

Replacing it by printwchar(wchar_t wc) should allow printing all the non ASCII characters.

I had to add libraries an set locales as in this minimal code to print a non-ASCII char: I added the setlocale in the redraw() function, to be sure it is executed, before to find a proper place.

#include <stdio.h>
#include <locale.h>
#include <wchar.h>

int main(int argc, char **argv)
{
       setlocale(LC_ALL, "");

       putwchar(L'█');
}

Doing this compiles without issue nor warning, and the client is running without crashing, but still no unicode support. The characters are still ignored.

That was expected: the argument sent to putwchar are still chars, and not wchar_ts.

So what I will try now is changing the type char to wchar_t all along the path from the server to the to the draw function (if I can do that). To make sure the putwchar() receive a wchar.

Hopefully this should suffice.

ghost commented 8 years ago

After reading a few more, this could boil down to make the function newline() support either wchar_t, and char, and find which call add the actual message to the buffer. But this would imply a function that has multiple possible kind of arguments, and I'm not sure it is possible in C. Checking the type in the function would not prevent the compilation warnings...

It would also require changing the buffer_line struct's text from char to wchar_t. But this would involve many replacements here and there to avoid the warnings.

It compiles, but when I run it, it blocks the terminal, and the output stay the same (the shell prompt, the last output...).

I just pushed the changes in a fork in an unicode branch with this WIP.

rcr commented 8 years ago

There's a quick way to fix most of the issues here:

This line: https://github.com/rcr/rirc/blob/aad5807f20b157a31fabf6b0e377b89958885403/src/mesg.c#L658 filters out valid bytes from unicode character sequences. You could simply allow the bytes through which would be printed correctly, however the big issue (and reason I haven't made this fix myself) is that the draw routines that split lines on whitespace expect each byte to be a printable character, so line splitting on long text would probably be broken. But It's a temporary fix that gets us most of the way there I suppose.

ghost commented 8 years ago

This may be simpler to implement, and feels much saner than my heavy proposition. I will look into this probably this summer holidays, or before if I can.

For the line size calculation, FRIGN talked about this library (and may soon re-write it from scratch, but I'm not sure): http://git.suckless.org/libutf/tree. MIT licence as well.

EDIT: s/made/talked about/

ghost commented 8 years ago

I just noticed this function in the repo of the vis editor, that seems to calculate the line lengths.

rcr commented 8 years ago

That could work as a template. It's probably good enough to start with even if it has buggy edge cases. I'm hoping to have some time to work on this stuff shortly, it's been a busy couple months :)

ghost commented 8 years ago

I will have exams soon, but I would be glad to contribute (not spamming like I did), which would mean that I managed to learn C properly.

https://github.com/martanne/vis/issues/117 could also help, maybe.

Busy times... Seems so frequent amongst devs!

ghost commented 7 years ago

I'm working on a side project, and implemented UTF-8 support which does not handles invalid text very well, but which works well for valid text.

At least, this works: [EDIT]: fixed for invalid UTF-8 characters.

/*
 * Returns the number of bytes for the first rune of `str`: 1, 2, 3, 4 or 0 for
 * malformed UTF-8.
 */
size_t
utf8_byte_count(char *str)
{
    size_t n = 1;

    /* check if multibyte */
    if (str[0] & 1 << 7) {

        if (!(str[0] & 1 << 6))
            return 0;

        /* get the number of continuation bytes */
        for (n = 1; (str[0] & 1 << (7 - n)); n++) {

            /* check formatting */
            if (n > 5 || !(str[n] & 1 << 7 && ~str[n] & 1 << 6))
                return 0;
        }
    }

    return n;
}

ghost commented 7 years ago

I fixed remaining UTF-8 and Unicode-related issues. The snippet above had a small bug and I fixed it.

rcr commented 7 years ago

Looks good. I finished refactoring all the buffer stuff and next I'm going to do the same for inputs. I wanted to make them more modular and tested, and it should make implementing unicode across the whole project much easier.

ghost commented 7 years ago

That sounds good too. I will be glad to read it, as I try to do some text interface without a library too (i.e. without ncurses), and draw.c is of a great help.

rcr commented 3 years ago

In progress, 5 years later. Draw code will account for UTF-8 column width and control characters are printed safely with a distinct colour, shipping with next version.

rcr / rirc

Better unicode support #14