memset optimizer(?) still thinks DCPU is 8-bit addressible

a1k0n commented 12 years ago

The optimizer seems to think a 16-bit set is enough to clear two adjacent words in some cases. The code is correct without -O.

$ cat memset3.c
void* mymemclr4(int *buf, int len) {
  int *end = buf+len;
  while(buf != end) {
    *buf++ = 0;
    *buf++ = 0;
    *buf++ = 0;
    *buf++ = 0;
  }
}

extern int buf[128];
int main() {
  mymemclr4(buf, sizeof(buf));
}

$ bin/clang -target dcpu16 -S memset3.c -O -o -
memset3.c:9:1: warning: control reaches end of non-void function [-Wreturn-type]
}
^
        ; .file "memset3.c"
        .text
        .globl  mymemclr4
        ; .align        1
:mymemclr4
        IFE     B, 0x0
        SET     PC, .LBB0_2
:.LBB0_1
        SET     [0x2+A], 0x0       ; <--- it somehow thinks set [a+2],0 and set [a], 0
        SET     [A], 0x0           ; <--- is sufficient to clear 4 words.
        ADD     A, 0x4
        ADD     B, 0xfffc    ; <-- it's also somewhat perplexing that it doesn't generate sub B, 4 here
        IFN     B, 0x0
        SET     PC, .LBB0_1
:.LBB0_2
        SET     PC, POP

        .globl  main
        ; .align        1
:main
        SET     PUSH, J
        SET     J, SP
        SET     A, SP
        SET     [A], 0x80
        SET     A, buf
        SET     B, 0x0
        SET     C, 0x0
        JSR     memset
        SET     A, 0x0
        SET     J, POP
        SET     PC, POP

krasin commented 12 years ago

Thanks for spotting this. This is a serious issue. :(

a1k0n commented 12 years ago

And actually, I didn't even notice this at first but it's not even calling "mymemclr4" -- it's inferring I'm doing a memset, and then generating an (incorrect) call to memset with a length of 0 per my other bug. :(

ghost commented 12 years ago

I'm pretty sure that this is a LLVM bug and it has the same cause as issue #153 of LLVM-DCPU16.

a1k0n commented 12 years ago

Well, you and Blei had to do a ton of work to make LLVM even familiar with the concept of 16-bit bytes, so I imagine this is going to be a continuation of that work rather than a bug in the original code per se. It looked to me like it was trying to figure out these store spans using BitsPerByte in the memset generator, so it's probably incompletely converted but I haven't found the fix.

Still, when have we last synced with upstream?

krasin commented 12 years ago

Thanks for the reminder. Syncing to upstream now...

krasin commented 12 years ago

The merge is completed. See https://github.com/llvm-dcpu16/llvm-dcpu16/pull/165 and https://github.com/llvm-dcpu16/clang/pull/15

a1k0n commented 12 years ago

Ohho!

define i16 @main() nounwind {
entry:
  store i32 0, i32* bitcast (i16* getelementptr inbounds ([128 x i16]* @buf, i16 0, i16 124) to i32*), align 1
  store i32 0, i32* bitcast (i16* getelementptr inbounds ([128 x i16]* @buf, i16 0, i16 120) to i32*), align 1
  store i32 0, i32* bitcast (i16* getelementptr inbounds ([128 x i16]* @buf, i16 0, i16 116) to i32*), align 1
  store i32 0, i32* bitcast (i16* getelementptr inbounds ([128 x i16]* @buf, i16 0, i16 112) to i32*), align 1
  store i32 0, i32* bitcast (i16* getelementptr inbounds ([128 x i16]* @buf, i16 0, i16 108) to i32*), align 1
  store i32 0, i32* bitcast (i16* getelementptr inbounds ([128 x i16]* @buf, i16 0, i16 104) to i32*), align 1
  store i32 0, i32* bitcast (i16* getelementptr inbounds ([128 x i16]* @buf, i16 0, i16 100) to i32*), align 1

It thinks it can make 32-bit stores. That's why it's doing this crazy stuff.

a1k0n commented 12 years ago

Okay, I have a patch which fixes a bunch of these issues in LLVM (basically replacing <8> with getBitsPerByte() everywhere). However, clang needs similar patching as it generates code which assumes, for instance, a 32-bit store is 4 "bytes" wide and it just generates invalid code. I'll have a pull request ready later.

I'm starting to wonder whether this is too invasive and we should just somehow force pointer alignment to 2 8-bit bytes.

Also the memset call is 32 bits wide and it puts the lower 16 bits of the length arg on the stack via [A], which I guess is correct. I didn't notice that before. We need to implement a memset builtin to avoid that.

krasin commented 12 years ago

Okay, I have a patch which fixes a bunch of these issues in LLVM (basically replacing <8> with getBitsPerByte() everywhere).

Sounds promising.

I'm starting to wonder whether this is too invasive and we should just somehow force pointer alignment to 2 8-bit bytes.

It might be that forcing pointer alignment to 2 8-bit bytes and tricking the assembler printer is easier, but it's clearly a hack. getBitsPerByte approach appears to be as invasive as we feared, but it's (in the long term) might be cleaned up enough to be upstreamed. llvmdev mailing list has a number of requests to add support for non 8-bit bytes, but it has never been implemented properly. It would be nice if dcpu16 community would be able to accomplish this mission.

Also the memset call is 32 bits wide and it puts the lower 16 bits of the length arg on the stack via [A], which I guess is correct. I didn't notice that before. We need to implement a memset builtin to avoid that.

Agree.

ghost commented 12 years ago

Just for reference: How did we fixed this? With the LLVM patch from Blei? (8675f9174f35bb539082709a65b83cf8b1a376b8)

a1k0n commented 12 years ago

Again it discarded my comment. But I shouldn't have closed this one.

It seems Blei's fix-framepointer patch doesn't fix this as I had thought. If, in the above example, you change the loop to *buf++ = 1, then it works as intended -- this is specifically a bug in clearing memory, and it seems to stem from the clang side generating incorrect llvm code.

a1k0n commented 12 years ago

I have just verified with the latest llvm that clang -fno-builtins cures this problem so I was correct to suspect the memset built-in initially.

llvm-dcpu16 / clang

memset optimizer(?) still thinks DCPU is 8-bit addressible #14