Explicit UTF-8 support - Githubissues

ghost commented 7 years ago

This is what I tried to do. I did not tested the whole of it, I will post update when I consider it stable enough.

In the meantime, here are helpful resources:

https://en.wikipedia.org/wiki/UTF-8#Description
https://github.com/cls/libutf
https://github.com/tmux/tmux/blob/master/utf8.c
https://en.wikibooks.org/wiki/C_Programming/Reference_Tables#Table_of_Data_Types

xxd(1) from vim

$ printf '┬' | xxd -b
00000000: 11100010 10010100 10101100                             ...

$ printf '┬' | ./test | xxd -b
00000000: 11100000 10010100 10101100                             ...

I had a tricky, hardly documented issue: char is signed by default (on my setup), so the leading 1xxxxxxx sets the sign. To retreive the leading 1xxxxxxx, I have to do a cast to (unsigned char).

ghost commented 7 years ago

Not to forget the utf-8(7) man page !

rnoth commented 7 years ago

Hey! Sorry about taking so long to notice this, I need to do something about notifications soon.

Thank you for all the resources. It'll likely take a day or two for me to work on the utf8 issue, but I can work out a implementation plan in the chunks of the day where I don't have access to a proper *nix box to code on.

At the moment I took a page from your book and broke everything on a 'revamp' branch, and sorta of relatedly started using Valgrind to iron out some illegal memory acesses.

(I also reneged and started doing ugly stuff with macros again. Gaze in horror at vector.h)

rnoth commented 7 years ago

Uhg, went off on a tangent and forgot one point -- I wanted to mention that I've started using a new typedef'd String struct in edna (that hasn't permeated throughout the codebase yet) primarily because I already find myself passing a size_t to almost every string-related function (plus I think there was an inconsistency on whether it was the string's length or allocated size). I'm not sure, since I haven't dug deep into utf8, but this may make the integration easier.

ghost commented 7 years ago

There is no problem, you can delay.

The idea is that a char is too small to store all Unicode characters. It only can take ASCII ones.

UTF-8 keeps all ASCII characters as they are, but for all others, its number is split into multiple parts, stored into multiple chars.

Those split chars have their first bit set to 1 to differentiate them with ASCII chars (only using the last 7).

So I guess a choice opens to you once you as you are re-designing the string handling:

Continue using char[]s and manage the special cases, calling an UTF-8 parser only when needed (calculating the number of chars in a string, the position...).
Make the text an array of something bigger than a char, able to hold every character, converting all the buffers to array of long ints (or something else) rather than on chars.

I guess it is easier to do this early. That's the reason why I was telling you this.

ghost commented 7 years ago

Your vector.h is certainly not horrible, it's like generalizing arrays to any data type. It is nice.

There actually are standard implementation of stacks using macros only, so it is far from being lame: queue(3) and sys/queue.h.

rnoth commented 7 years ago

Yes, I understand the reasoning of doing it early, since the buggy code will just pile up. It occurred to me multiple times, but I always found an excuse to work on something else.

The first approach has a certain convience to it, but there doesn't seem to be any actual gain from it versus the second, where the mapping from entries in memory to characters is explicit and obvious. Space isn't a concern, and {de,re}coding only ever needs to happen at the three places where the text flow extends out of the program -- readline(), writebuf(), and print() -- instead of anytime you do a certain operation on lines in memory.

Is there anything I'm missing?

Also, that queue manpage is an interesting find. It's like discovering hcreate(3) et al. -- I keep finding myself surprised at how much general stuff POSIX ships without me ever finding out.

ghost commented 7 years ago

Maybe this is it if you do not use standard string manipulation like strcat or other that acts on chars *. Only functions that move text in and out of the program.

I picked long[].

I imagine some people use a database library while hcreate(3) could have been enough. That is a good surprise to find it here.

ghost commented 7 years ago

Starting from C99, there are built-in functions to deal with "multibytes" encodings such as UTF-8: mbstowcs(3), mbtowc(3), mblen(3)...

rnoth / edna

Explicit UTF-8 support #1