Closed ghost closed 7 years ago
Not to forget the utf-8(7) man page !
Hey! Sorry about taking so long to notice this, I need to do something about notifications soon.
Thank you for all the resources. It'll likely take a day or two for me to work on the utf8 issue, but I can work out a implementation plan in the chunks of the day where I don't have access to a proper *nix box to code on.
At the moment I took a page from your book and broke everything on a 'revamp' branch, and sorta of relatedly started using Valgrind to iron out some illegal memory acesses.
(I also reneged and started doing ugly stuff with macros again. Gaze in horror at vector.h)
Uhg, went off on a tangent and forgot one point -- I wanted to mention that I've started using a new typedef'd String struct in edna (that hasn't permeated throughout the codebase yet) primarily because I already find myself passing a size_t to almost every string-related function (plus I think there was an inconsistency on whether it was the string's length or allocated size). I'm not sure, since I haven't dug deep into utf8, but this may make the integration easier.
There is no problem, you can delay.
The idea is that a char is too small to store all Unicode characters. It only can take ASCII ones.
UTF-8 keeps all ASCII characters as they are, but for all others, its number is split into multiple parts, stored into multiple chars.
Those split char
s have their first bit set to 1
to differentiate them with ASCII chars (only using the last 7).
So I guess a choice opens to you once you as you are re-designing the string handling:
Continue using char[]
s and manage the special cases, calling an UTF-8 parser only when needed (calculating the number of chars in a string, the position...).
Make the text an array of something bigger than a char
, able to hold every character, converting all the buffers to array of long int
s (or something else) rather than on char
s.
I guess it is easier to do this early. That's the reason why I was telling you this.
Your vector.h
is certainly not horrible, it's like generalizing arrays to any data type. It is nice.
There actually are standard implementation of stacks using macros only, so it is far from being lame: queue(3)
and sys/queue.h
.
Yes, I understand the reasoning of doing it early, since the buggy code will just pile up. It occurred to me multiple times, but I always found an excuse to work on something else.
The first approach has a certain convience to it, but there doesn't seem to be any actual gain from it versus the second, where the mapping from entries in memory to characters is explicit and obvious. Space isn't a concern, and {de,re}coding only ever needs to happen at the three places where the text flow extends out of the program -- readline(), writebuf(), and print() -- instead of anytime you do a certain operation on lines in memory.
Is there anything I'm missing?
Also, that queue manpage is an interesting find. It's like discovering hcreate(3) et al. -- I keep finding myself surprised at how much general stuff POSIX ships without me ever finding out.
Maybe this is it if you do not use standard string manipulation like strcat
or other that acts on chars *
. Only functions that move text in and out of the program.
I picked long[]
.
I imagine some people use a database library while hcreate(3)
could have been enough. That is a good surprise to find it here.
Starting from C99
, there are built-in functions to deal with "multibytes" encodings such as UTF-8: mbstowcs(3)
, mbtowc(3)
, mblen(3)
...
This is what I tried to do. I did not tested the whole of it, I will post update when I consider it stable enough.
In the meantime, here are helpful resources:
https://en.wikipedia.org/wiki/UTF-8#Description
https://github.com/cls/libutf
https://github.com/tmux/tmux/blob/master/utf8.c
https://en.wikibooks.org/wiki/C_Programming/Reference_Tables#Table_of_Data_Types
xxd(1)
from vimI had a tricky, hardly documented issue:
char
is signed by default (on my setup), so the leading1xxxxxxx
sets the sign. To retreive the leading1xxxxxxx
, I have to do a cast to(unsigned char)
.