protopopov1122 / kefir

C17 compiler implementation from scratch
64 stars 2 forks source link

kefir silently chokes on UTF8 in comments #3

Closed mulle-nat closed 1 year ago

mulle-nat commented 1 year ago

I wanted to try kefir on my projects, but I got a really strange error early on. It turns out that UTF8 in comments breaks things silently:

//
//  mulle_c11.h
//
//  Copyright © 2016 Mulle kybernetiK. All rights reserved.
//  Copyright © 2016 Codeon GmbH. All rights reserved.
//
int a;

kefir vs gcc:

$ $CC -E /tmp/x.c 

$ gcc -E /tmp/x.c  
# 0 "/tmp/x.c"
# 0 "<built-in>"
# 0 "<command-line>"
# 1 "/usr/include/stdc-predef.h" 1 3 4
# 0 "<command-line>" 2
# 1 "/tmp/x.c"

int a;
protopopov1122 commented 1 year ago

Thanks for reporting the issue. Could you please provide more details of your environment (locale, operating system, libc)? Please also make sure that your locale charmap is actually UTF-8 (on Linux, you can usually check that via locale charmap command). Kefir uses system locale when preprocessing and lexing input files. For instance:

$ cat 1.c
//
//  mulle_c11.h
//
//  Copyright © 2016 Mulle kybernetiK. All rights reserved.
//  Copyright © 2016 Codeon GmbH. All rights reserved.
//
int a;
$ LC_ALL=C locale charmap                             
ANSI_X3.4-1968
$ LC_ALL=C kefir --target x86_64-host-none -E 1.c 

$ LC_ALL=C.UTF-8 locale charmap                       
UTF-8
$ LC_ALL=C.UTF-8 kefir --target x86_64-host-none -E 1.c

int a;
mulle-nat commented 1 year ago

Sure. Here's my OS environment:

$ locale charmap
UTF-8
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 22.04.1 LTS
Release:    22.04
Codename:   jammy

I set the environment variables of kefir, like I saw them defined in ubuntu-misc.yml, which worked fine until I hit the header. I don't think the libc used comes into play, since I am just preprocessing and not including anything.

protopopov1122 commented 1 year ago

Seems like I was able to reproduce the issue, and it might be caused by missing locale definitions for user-preferred locale. The underlying problem can be reproduced by this code snippet:

#include <uchar.h>
#include <stdlib.h>
#include <stdio.h>
#include <locale.h>

const char string[] = "\xc2\xa9";

int main(int argc, const char **argv) {
        setlocale(LC_ALL, "");
        printf("%s\n", string);
    mbstate_t mbstate = {0};
    char32_t chr = U'\0';
    size_t rc = mbrtoc32(&chr, string, sizeof(string), &mbstate);
    printf("%d %u\n", (int) rc, chr);
    return EXIT_SUCCESS;
}

Which outputs following:

root@29eba25fedbc:/# locale -a
C
C.utf8
POSIX
root@29eba25fedbc:/# gcc -o test test.c && LC_ALL=en_US.UTF-8 ./test
©
-1 0
root@29eba25fedbc:/# locale-gen en_US.UTF-8
Generating locales (this might take a while)...
  en_US.UTF-8... done
Generation complete.
root@29eba25fedbc:/# locale -a
C
C.utf8
POSIX
en_US.utf8
root@29eba25fedbc:/# gcc -o test test.c && LC_ALL=en_US.UTF-8 ./test
©
2 169

Kefir relies on mbrtoc32 function for decoding, and glibc seems to use system locale definitions to implement that function. Can you check your current locale (locale command) and make sure that it's actually available on the system (locale -a)?

mulle-nat commented 1 year ago

Interesting. So today after a fresh reboot I did this:

$ cat > x.c
#include <uchar.h>
#include <stdlib.h>
#include <stdio.h>
#include <locale.h>

const char string[] = "\xc2\xa9";

int main(int argc, const char **argv) {
        setlocale(LC_ALL, "");
        printf("%s\n", string);
   mbstate_t mbstate = {0};
   char32_t chr = U'\0';
   size_t rc = mbrtoc32(&chr, string, sizeof(string), &mbstate);
   printf("%d %u\n", (int) rc, chr);
   return EXIT_SUCCESS;
}
$ cc -o x x.c
$ ./x
©
2 169
$ locale -a
C
C.utf8
de_DE.utf8
en_AG
en_AG.utf8
en_AU.utf8
en_BW.utf8
en_CA.utf8
en_DK.utf8
en_GB.utf8
en_HK.utf8
en_IE.utf8
en_IL
en_IL.utf8
en_IN
en_IN.utf8
en_NG
en_NG.utf8
en_NZ.utf8
en_PH.utf8
en_SG.utf8
en_US.utf8
en_ZA.utf8
en_ZM
en_ZM.utf8
en_ZW.utf8
POSIX
$ cat /home/src/srcO/mulle-cc/kefir/env 
export KEFIR_RTLIB="/home/src/srcO/mulle-cc/kefir/bin/libs/libkefirrt.a"
export KEFIR_RTINC="/home/src/srcO/mulle-cc/kefir/headers/kefir/runtime"
export KEFIR_GNU_INCLUDE="/usr/lib/gcc/x86_64-linux-gnu/11/include;/usr/include/x86_64-linux-gnu;/usr/include;/usr/local/include"
export KEFIR_GNU_LIB="/usr/lib/x86_64-linux-gnu;/usr/lib/gcc/x86_64-linux-gnu/11/;/usr/lib;/usr/local/lib"
export KEFIR_GNU_DYNAMIC_LINKER="/lib64/ld-linux-x86-64.so.2"
export KEFIRCC=/home/src/srcO/mulle-cc/kefir/bin/kefir
export CC="${KEFIRCC}"
$ . /home/src/srcO/mulle-cc/kefir/env 
$ cat > y.c
//
//  mulle_c11.h
//
//  Copyright © 2016 Mulle kybernetiK. All rights reserved.
//  Copyright © 2016 Codeon GmbH. All rights reserved.
//
int a;
$ ${CC} -E y.c 

int a;

Which indicates to me, that for some reason the locale of my last session must have been corrupted (in multiple terminals even), though I didn't do that intentionally nor would I know what may have caused this. In other words I can't reproduce this.

Nevertheless the silent failure of the compiler was unfortunate and the problem part, was really hard to track down.

protopopov1122 commented 1 year ago

Agreed, silent failure is unhelpful. I've pushed some fixes to produce an error when input decoding fails.