juj / fbcp-ili9341

A blazing fast display driver for SPI-based LCD displays for Raspberry Pi A, B, 2, 3, 4 and Zero
MIT License
1.59k stars 265 forks source link

Pi Zero W compiler error: "asm operand has impossible constraints" #171

Open davidje13 opened 3 years ago

davidje13 commented 3 years ago

When trying to build on a Pi Zero W (using DietPi as the base OS) I get the following:

.../fbcp-ili9341/diff.cpp: In function "int coarse_linear_diff(uint16_t*, uint16_t*, uint16_t*)":
.../fbcp-ili9341/diff.cpp:69:4: error: "asm" operand has impossible constraints

I am entirely unfamiliar with ASM programming, but THE INTERNET suggests this is due to the target architecture not having enough scratch registers (in this case, presumably 3 are needed. Maybe 4 if it has to include the return too).

By following that answer's advice (changing the r flags to g for the 3 input params) I am able to continue to the next error (i.e. the next assembly block in the same file). And fixing that gets me to:

/tmp/cc3jYkbu.s:40: Error: immediate expression requires a # prefix -- `mov r0,[fp,#-48]'
[etc]

Which is presumably because "g" doesn't actually mean "magically make this thing work please".

I see from the readme and config options that this library has been built for the Pi Zero before, so hopefully this is something easy to fix.

My full installation looked like this:

sudo apt-get install -y git gcc g++ libc-dev libraspberrypi-dev make cmake;
git clone https://github.com/juj/fbcp-ili9341.git;
mkdir -p fbcp-ili9341/build;
cd fbcp-ili9341/build;
cmake \
  -DILI9341=ON \
  -DGPIO_TFT_DATA_CONTROL=25 \
  -DGPIO_TFT_RESET_PIN=23 \
  -DGPIO_TFT_BACKLIGHT=18 \
  -DSPI_BUS_CLOCK_DIVISOR=16 \
  -DSINGLE_CORE_BOARD=ON \
  -DSTATISTICS=2 \
  ..;

make -j;
davidje13 commented 3 years ago

I bumbled around a bit and eventually got something which at least builds:

I replaced one of the input r flags with g in each assembly block in diff.cpp, and changed the corresponding mov instructions to str. For the first block I changed prevFrameBuffer. For the second I changed framebuffer.

I cannot emphasise enough that I have no idea what I am doing, but with those changes it compiled "successfully".

From what I gather, g is choosing to represent these values as frame pointers (generating [fp,#-??] references), hence the need to switch mov to str. But it seems g is fickle, and there's nothing to guarantee it will do this (it could pick something else). Somebody with actual ASM experience will need to weigh-in on that.

The really weird part (for me) is that I can't just change any input to g and expect it to work. OK so it's pretty obvious why I can't change prevFrameBuffer in the second block (it's a calculated value so has no memory location to point to) but I was surprised to find that I couldn't change either of the other 2 inputs for the first block, nor frameBufferBegin in the second, without it complaining about "internal_relocation (type OFFSET_IMM) not fixed up". From what I can gather, that relates to near vs far memory pointers, but why that would be relevant when reading something from the stack (presumably pretty near?) I don't know. But then I don't know for sure what the generated assembly looks like so maybe the compiler re-ordered something or it's an error from later in the assembly block, I don't know.


So it compiles with that, but when I run it, it promptly segfaults just after the "All initialized, now running main loop..." log line. Which implies it's crashing the first time it encounters my mangled assembly. Guess I got something wrong then.


Here's the patch so far (minus some windows/unix line ending changes which hopefully won't break things):

diff --git a/diff.cpp b/diff.cpp
index 8966a15..fd0cc43 100644
--- a/diff.cpp
+++ b/diff.cpp
@@ -30,7 +30,7 @@ static int coarse_linear_diff(uint16_t *framebuffer, uint16_t *prevFramebuffer,
   asm volatile(
     "mov r0, %[framebufferEnd]\n" // r0 <- pointer to end of current framebuffer
     "mov r1, %[framebuffer]\n"   // r1 <- current framebuffer
-    "mov r2, %[prevFramebuffer]\n" // r2 <- framebuffer of previous frame
+    "str r2, %[prevFramebuffer]\n" // r2 <- framebuffer of previous frame

   "start_%=:\n"
     "pld [r1, #128]\n" // preload data caches for both current and previous framebuffers 128 bytes ahead of time
@@ -64,7 +64,7 @@ static int coarse_linear_diff(uint16_t *framebuffer, uint16_t *prevFramebuffer,
   "done_%=:\n"
     "mov %[endPtr], r1\n" // output endPtr back to C code
     : [endPtr]"=r"(endPtr)
-    : [framebuffer]"r"(framebuffer), [prevFramebuffer]"r"(prevFramebuffer), [framebufferEnd]"r"(framebufferEnd)
+    : [framebuffer]"r"(framebuffer), [prevFramebuffer]"g"(prevFramebuffer), [framebufferEnd]"r"(framebufferEnd)
     : "r0", "r1", "r2", "r3", "r4", "r5", "r6", "r7", "r8", "r9", "r10", "cc"
   );
   return endPtr - framebuffer;
@@ -77,7 +77,7 @@ static int coarse_backwards_linear_diff(uint16_t *framebuffer, uint16_t *prevFra
   uint16_t *endPtr;
   asm volatile(
     "mov r0, %[framebufferBegin]\n" // r0 <- pointer to beginning of current framebuffer
-    "mov r1, %[framebuffer]\n"   // r1 <- current framebuffer (starting from end of framebuffer)
+    "str r1, %[framebuffer]\n"   // r1 <- current framebuffer (starting from end of framebuffer)
     "mov r2, %[prevFramebuffer]\n" // r2 <- framebuffer of previous frame (starting from end of framebuffer)

   "start_%=:\n"
@@ -112,7 +112,7 @@ static int coarse_backwards_linear_diff(uint16_t *framebuffer, uint16_t *prevFra
   "done_%=:\n"
     "mov %[endPtr], r1\n" // output endPtr back to C code
     : [endPtr]"=r"(endPtr)
-    : [framebuffer]"r"(framebufferEnd), [prevFramebuffer]"r"(prevFramebuffer+(framebufferEnd-framebuffer)), [framebufferBegin]"r"(framebuffer)
+    : [framebuffer]"g"(framebufferEnd), [prevFramebuffer]"r"(prevFramebuffer+(framebufferEnd-framebuffer)), [framebufferBegin]"r"(framebuffer)
     : "r0", "r1", "r2", "r3", "r4", "r5", "r6", "r7", "r8", "r9", "r10", "cc"
   );
   return endPtr - framebuffer;

I'd love it if somebody who actually understands this stuff could weigh-in and explain how to fix this correctly.

davidje13 commented 3 years ago

Switching from DietPi to Pi OS Lite resolves this, but I don't know what the differences are. Happy to run experiments if anybody wants to investigate further.

I'll probably try PiCore later so I'll see if that has the same problem.

juj commented 3 years ago

hmmhmm, very odd indeed. This kind of thing seems to be about the compiler version, rather than he hardware or OS itself (though OS distros certainly dictate which compiler is available by default). Might try running gcc --version to check if it's an older or newer compiler that complains about the multitude of r constraints.

Using g constraint should be fine as well, although not sure about some of the str instructions there. At a glance they seem to turn a register load into a store.

davidje13 commented 3 years ago

Hmm, I should have checked the compiler version before I clobbered it with Pi OS Lite. I'll find another SD card and check soon. In theory it uses the same apt repositories (debian + raspbian) so should get the same versions, but maybe something funny is going on.

For what it's worth, the working compiler is gcc (Raspbian 8.3.0-6+rpi1) 8.3.0

As I say, I have basically zero experience with assembly so I have no idea what I'm doing there. What I know is that changing "r" to "g" caused it to enter [fp,#-??] values into the assembly, which it then complained weren't preceded by # for some reason. Some searching showed me str which appeared to be used to load values from a memory location into a register, and I saw a few examples of people using it with [fp,...] constructs to read (or maybe write?) function parameters, but although it compiled like that it immediately exploded at runtime, so clearly it's wrong.

juj commented 3 years ago

My expectation with the code in that function was that it would get inlined and avoid any memory load/stores in the first place. I.e. the movs from those variables should be getting optimized out as redundant reg->reg moves, and they'd only serve to "document" what data is flowing in. A g constraint should work for that purpose as well, but if it did introduce frame pointer related loads, then it was certainly not getting optimized out. Although in this particular case, the ingress/egress of those functions is not that hot, that a single extra instruction or two would not make a large difference.