Open a1k0n opened 12 years ago
Thanks for spotting this. This is a serious issue. :(
And actually, I didn't even notice this at first but it's not even calling "mymemclr4" -- it's inferring I'm doing a memset, and then generating an (incorrect) call to memset with a length of 0 per my other bug. :(
I'm pretty sure that this is a LLVM bug and it has the same cause as issue #153 of LLVM-DCPU16.
Well, you and Blei had to do a ton of work to make LLVM even familiar with the concept of 16-bit bytes, so I imagine this is going to be a continuation of that work rather than a bug in the original code per se. It looked to me like it was trying to figure out these store spans using BitsPerByte in the memset generator, so it's probably incompletely converted but I haven't found the fix.
Still, when have we last synced with upstream?
Thanks for the reminder. Syncing to upstream now...
The merge is completed. See https://github.com/llvm-dcpu16/llvm-dcpu16/pull/165 and https://github.com/llvm-dcpu16/clang/pull/15
Ohho!
define i16 @main() nounwind {
entry:
store i32 0, i32* bitcast (i16* getelementptr inbounds ([128 x i16]* @buf, i16 0, i16 124) to i32*), align 1
store i32 0, i32* bitcast (i16* getelementptr inbounds ([128 x i16]* @buf, i16 0, i16 120) to i32*), align 1
store i32 0, i32* bitcast (i16* getelementptr inbounds ([128 x i16]* @buf, i16 0, i16 116) to i32*), align 1
store i32 0, i32* bitcast (i16* getelementptr inbounds ([128 x i16]* @buf, i16 0, i16 112) to i32*), align 1
store i32 0, i32* bitcast (i16* getelementptr inbounds ([128 x i16]* @buf, i16 0, i16 108) to i32*), align 1
store i32 0, i32* bitcast (i16* getelementptr inbounds ([128 x i16]* @buf, i16 0, i16 104) to i32*), align 1
store i32 0, i32* bitcast (i16* getelementptr inbounds ([128 x i16]* @buf, i16 0, i16 100) to i32*), align 1
It thinks it can make 32-bit stores. That's why it's doing this crazy stuff.
Okay, I have a patch which fixes a bunch of these issues in LLVM (basically replacing <8> with getBitsPerByte() everywhere). However, clang needs similar patching as it generates code which assumes, for instance, a 32-bit store is 4 "bytes" wide and it just generates invalid code. I'll have a pull request ready later.
I'm starting to wonder whether this is too invasive and we should just somehow force pointer alignment to 2 8-bit bytes.
Also the memset call is 32 bits wide and it puts the lower 16 bits of the length arg on the stack via [A], which I guess is correct. I didn't notice that before. We need to implement a memset builtin to avoid that.
Okay, I have a patch which fixes a bunch of these issues in LLVM (basically replacing <8> with getBitsPerByte() everywhere).
Sounds promising.
I'm starting to wonder whether this is too invasive and we should just somehow force pointer alignment to 2 8-bit bytes.
It might be that forcing pointer alignment to 2 8-bit bytes and tricking the assembler printer is easier, but it's clearly a hack. getBitsPerByte approach appears to be as invasive as we feared, but it's (in the long term) might be cleaned up enough to be upstreamed. llvmdev mailing list has a number of requests to add support for non 8-bit bytes, but it has never been implemented properly. It would be nice if dcpu16 community would be able to accomplish this mission.
Also the memset call is 32 bits wide and it puts the lower 16 bits of the length arg on the stack via [A], which I guess is correct. I didn't notice that before. We need to implement a memset builtin to avoid that.
Agree.
Just for reference: How did we fixed this? With the LLVM patch from Blei? (8675f9174f35bb539082709a65b83cf8b1a376b8)
Again it discarded my comment. But I shouldn't have closed this one.
It seems Blei's fix-framepointer patch doesn't fix this as I had thought. If, in the above example, you change the loop to *buf++ = 1, then it works as intended -- this is specifically a bug in clearing memory, and it seems to stem from the clang side generating incorrect llvm code.
I have just verified with the latest llvm that clang -fno-builtins cures this problem so I was correct to suspect the memset built-in initially.
The optimizer seems to think a 16-bit set is enough to clear two adjacent words in some cases. The code is correct without -O.