UTF-8 string literals - Githubissues

srackham commented 2 years ago

Does Virgil support UTF-8 string literals?

The documentation suggests it does: https://github.com/titzer/virgil/blob/3038dead280099b736f312e2b091b053cb0cfbf7/doc/lib-issues.txt#L116

Here I've inserted the copyright character in a string literal:

$ cat hello.v3    
def main() {
        System.puts("Hello World ©\n");
}

$ virgil run tmp/hello.v3
[tmp/hello.v3 @ 2:21] ParseError: invalid string literal
        System.puts("Hello World ©\n");
                    ^

Hex byte values work though:

$ cat hello.v3
def main() {
        System.puts("Hello World \xC2\xA9\n");
}

$ virgil run hello.v3
Hello World ©

titzer commented 2 years ago

You're right, that's a bug. It should handle UTF-8 in string literals, but it does not yet.

I was planning on improving the support for unicode by changing the string type (currently an alias for Array<byte>), but this is something that could maybe supported by just allowing the UTF-8 representation through.

srackham commented 2 years ago

Thanks.

A workaround is to convert UTF-8 strings to hex byte values with, for example:

$ echo -n "Hello World ©" | od -A n -t x1 | tr -d '\n' | sed 's/ /\\x/g'
\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64\x20\xc2\xa9

titzer / virgil

UTF-8 string literals #77